RSS 2.0

Personal Info:

Joe Send mail to the author(s) is a lead architect on an OS incubation project at Microsoft, and was the architect for Parallel Extensions to .NET. He is an author and frequent speaker.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2010, Joe Duffy

 
 Sunday, September 21, 2008

The enumeration pattern in .NET unfortunately implies some overhead that makes it difficult to compete with ordinary for loops.  In fact, the difference between

T[] a = …;

for (int i = 0, c = a.Length; i < c; i++) …action(a[i])…;

and

T[] a = …;

IEnumerator<int> ae = ((IEnumerable<T>)a).GetEnumerator();

while (ae.MoveNext()) …action(ae.Current)…;

is about 3X.  That is, the former is 1/3rd the expense of the latter, in terms of raw enumeration overhead.  Clearly as action becomes more expensive the significance of this overhead lessens.  But if your plan is to invoke a small action over a large number of elements, using an enumerator instead of indexing directly into the array could in fact cause your algorithm to take 3X longer to finish.

There are many reasons for this problem.  They are probably obvious.  Using an enumerator requires at least two interface method calls just to extract a single element from the array.  Because there are O(length) number of these operations, the overhead imposed will be O(length) as well.  Contrast that with the nice, compact for loop, which emits ldarg IL instructions that access the array directly.  This will end up computing some offset (e.g., i * sizeof(T)) and dereferencing right into the array memory.  The enumerator needs to do that, of course, but only after the two interface calls are made.  Additionally, it is possible for the JIT compiler to omit the bounds check on the array access if it knows ‘c’ in the predicate ‘i < c’ was computed from ‘a.Length’, because arrays in .NET are immutable and their size cannot change.

(Strangely, it appears going through IList<T> is even slower than enumeration.  In fact, it appears to be more than 3X the cost of going through IList<T>’s enumerator, and over 10X that of indexing into the array using true ldarg instructions instead of interface calls to IList<T>’s element indexer.)

All of this actually makes it somewhat difficult for those on my team building PLINQ to compete with hand written programs.  That’s true of LINQ generally.  In fact, LINQ tends to be worse, because you string several enumerators together to form a query, often leading to even more overhead attributed to enumeration.  So you might reasonably wonder: if people care about performance, then why would they willingly start off 3X “in the hole” in hopes that they will eventually gain it back when they use machines with >= 4 cores?  It’s a completely fair criticism (although you must recall that everything I’m talking about is “pure overhead” and once you begin to have sizable computations in the per-element action it matters less and less).  We continually do a lot of work to try to recoup these costs.

There are actually many alternative enumeration models, and I think .NET needs to change direction in the future.  In addition to the overhead associated with the pattern, .NET’s enumeration pattern is a “pull” model (versus “push”), which makes it incredibly hard to tolerate blocking within calls to MoveNext.  Over time, I think we will need to pursue the push model more seriously.

I’ve thrown together a few different examples of alternative enumeration techniques.  To cut to the chase, here is a simple micro-benchmark test that enumerates over 1,000,000 elements 25 times, invoking an empty (non-inlineable) method for each element.  The per-element work here is quite small (although not empty) and so the results are a bit more extreme than a real workload would show:

For loop (int[])        739255 tcks     % of baseline

For loop (IList<int>)   7534609 tcks    1019.216%

ForEach loop (int[])    829617 tcks     112.2234%

int[] IEnumerator<int>  2152414 tcks    291.1599%

IEnumerator<int>        2062876 tcks    279.048%

IFastEnumerator<int>    1758992 tcks    237.9412%

IForEachable<int> [s]   1103745 tcks    149.305%

IForEachable<int> [i]   976742 tcks     132.1252%

IForEachable2<int>      957883 tcks     129.5741%

These are:

  • “For loop (int[])” is an ordinary for loop over the array directly.
  • “For loop (IList<int>)” is an ordinary for loop over the array’s IList<T> interface.
  • “ForEach loop (int[])” is an ordinary foreach loop over the array directly.
  • “int[] IEnumerator<int>” uses the array’s implementation of IEnumerator<T>.
  • “IEnumerator<int>” is a custom IEnumerator<T> implementation.
  • “IFastEnumerator<int>” is an implementation of new pull interface (defined below).
  • “IForEachable<int>” is an implementation of a new push interface (defined below) that uses delegates to represent the per-element action.  The only difference between the “[s]” and “[i]” variants are that the delegate is bound to a static method for “[s]” and an instance method for “[i]”.
  • “IForEachable2<int>” is a slight variant of IForEachable<T> (also defined below).

Notice that with IForEachable2<T>, we’ve gotten within 30% of the efficient for loop.  Unfortunately, I do get somewhat different numbers when compiling with the /o+ switch:

For loop (int[])        777746 tcks     % of baseline

For loop (IList<int>)   7569517 tcks    973.2634%

ForEach loop (int[])    735846 tcks     94.61264%

int[] IEnumerator<int>  2340361 tcks    300.9159%

IEnumerator<int>        2063039 tcks    265.2587%

IFastEnumerator<int>    1806568 tcks    232.2825%

IForEachable<int> [s]   1090644 tcks    140.2314%

IForEachable<int> [i]   946090 tcks     121.6451%

IForEachable2<int>      1234201 tcks    158.6895%

For comparison purposes, I get numbers like this if the loop body is completely empty except for accessing the current element:

For loop (int[])        452039 tcks     % of baseline

For loop (IList<int>)   422732 tcks     93.51671%

ForEach loop (int[])    461274 tcks     102.043%

int[] IEnumerator<int>  1958711 tcks    433.3058%

IEnumerator<int>        1730502 tcks    382.8214%

IFastEnumerator<int>    1372421 tcks    303.6068%

IForEachable<int> [s]   1091720 tcks    241.5101%

IForEachable<int> [i]   958401 tcks     212.0173%

IForEachable2<int>      664572 tcks     147.0165%

And this (with /o+):

For loop (int[])        262146 tcks     % of baseline

For loop (IList<int>)   263302 tcks     100.441%

ForEach loop (int[])    372924 tcks     142.2581%

int[] IEnumerator<int>  1889132 tcks    720.6412%

IEnumerator<int>        1635837 tcks    624.0175%

IFastEnumerator<int>    1479579 tcks    564.4103%

IForEachable<int> [s]   1096712 tcks    418.3592%

IForEachable<int> [i]   962261 tcks     367.0706%

IForEachable2<int>      698340 tcks     266.3935%

These numbers aren’t quite as meaningful because we have no idea what’s being optimized away by the C# and JIT compilers.  For example, they may notice we’re not using the current element at all and therefore eliminate the access altogether.  Nevertheless, the relative ranking of efficiency has remained nearly the same (with the notable exception of the array’s IList<T> test being much less worse).

(All of these numbers were gathered on a 32-bit OS on a 64-bit machine.  Because the JIT compilers for 32-bit and 64-bit are so different, you can expect vastly different results across architectures.)

Anyway, here is what IFastEnumerator<T>, IForEachable<T>, and IForEachable2<T> look like:

interface IFastEnumerable<T>

{

    IFastEnumerator<T> GetEnumerator();

}

 

interface IFastEnumerator<T>

{

    bool MoveNext(ref T elem);

}

 

interface IForEachable<T>

{

    void ForEach(Action<T> action);

}

 

interface IForEachable2<T>

{

    void ForEach(Functor<T> functor);

}

 

abstract class Functor<T>

{

    public abstract void Invoke(T t);

}

I also have a data type called SimpleList<T> that implements each of these, including IEnumerable<T>.  This is what the test harness uses for its benchmarking.  So any boneheaded mistakes I’ve made in the implementation of this class could cause us to draw the wrong conclusions about the interfaces themselves.  Hopefully there are none:

class SimpleList<T> :

    IEnumerable<T>, IFastEnumerable<T>, IForEachable<T>, IForEachable2<T>

{

    private T[] m_array;

 

    public SimpleList(T[] array) { m_array = array; }

 

       // Etc …

}

The class of course implements IEnumerable<T> in the standard way:

IEnumerator<T> IEnumerable<T>.GetEnumerator()

{

    return new ClassicEnumerable(m_array);

}

 

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()

{

    return new ClassicEnumerable(m_array);

}

 

class ClassicEnumerable : IEnumerator<T>

{

    private T[] m_a;

    private int m_index = -1;

    internal ClassicEnumerable(T[] a) { m_a = a; }

 

    public bool MoveNext() { return ++m_index < m_a.Length; }

    public T Current { get { return m_a[m_index]; } }

    object System.Collections.IEnumerator.Current { get { return Current; } }

    public void Reset() { m_index = -1; }

    public void Dispose() { }

}

The idea behind IFastEnumerable<T> (and specifically IFastEnumerator<T>) is to return the current element during the call to MoveNext itself.  This cuts the number of interface method calls necessary to enumerate a list in half.  The impact to performance isn’t huge, but it was enough to cut our overhead from about 3X to 2.3X.  Every little bit counts:

IFastEnumerator<T> IFastEnumerable<T>.GetEnumerator()

{

    return new FastEnumerable(m_array);

}

 

class FastEnumerable : IFastEnumerator<T>

{

    private T[] m_a;

    private int m_index = -1;

    internal FastEnumerable(T[] a) { m_a = a; }

 

    public bool MoveNext(ref T elem)

    {

        if (++m_index >= m_a.Length)

            return false;

        elem = m_a[m_index];

        return true;

    }

}

(Update: after writing the blog post, I made a couple slight optimizations that make this a bit tighter (fewer field fetches):

class FastEnumerable : IFastEnumerator<T>

{

    private T[] m_a;

    private int m_index = -1;

    internal FastEnumerable(T[] a) { m_a = a; }

 

    public bool MoveNext(ref T elem)

    {

        T[] a = m_a;

        int i;

        if ((i = ++m_index) >= a.Length)

            return false;

        elem = a[i];

        return true;

    }

}

The impact to performance isn't huge, but does improve the performance to about 2.1X of the baseline.)

The IForEachable<T> interface is a push model in the sense that the caller provides a delegate and the ForEach method is responsible for invoking it once per element in the collection.  ForEach doesn’t return until this is done.  In addition to having far fewer method calls to enumerate a collection, there isn’t a single interface method call.  Delegate dispatch is also much faster than interface method dispatch.  The result is nearly twice as fast as the classic IEnumerator<T> pattern (when /o+ isn’t defined).  Now we’re really getting somewhere!

void IForEachable<T>.ForEach(Action<T> action)

{

    T[] a = m_array;

    for (int i = 0, c = a.Length; i < c; i++)

        action(a[i]);

}

Delegate dispatch still isn’t quite the speed of virtual method dispatch.  And delegates bound to static methods are actually slightly slower than those bound to instance methods, which is why you’ll notice a slight difference in the original “[s]” versus “[i]” measurements.  The reason is subtle.  There is a delegate dispatch stub that is meant to call the target method: when the delegate refers to an instance method, the ‘this’ reference pushed in EAX points to the delegate object when it is invoked and the stub can simply replace it with the target object and jump; for static methods, however, all of the arguments need to be “shifted” downward, because there is no ‘this’ reference to be passed and therefore the first actual argument to the static method must take the place of the current value in EAX.

The IForEachable2<T> interface just replaces delegate calls with virtual method calls.  Somebody calling it will pass an instance of the Functor<T> class with the Invoke method overridden.  The implementation of ForEach then looks quite a bit like IForEachable<T>’s, just with virtual method calls in place of delegate calls:

void IForEachable2<T>.ForEach(Functor<T> functor)

{

    T[] a = m_array;

    for (int i = 0, c = a.Length; i < c; i++)

        functor.Invoke(a[i]);

}

And that’s it.  Here is the program that drives the little micro-benchmark tests that I showed output for at the beginning:

class Program

{

    public static void Main()

    {

        const int size = 2500000;

 

        Random r = new Random();

        int[] array = new int[size];

        for (int i = 0; i < size; i++) array[i] = r.Next();

 

        SimpleList<int> list = new SimpleList<int>(array);

 

        const int iters = 25;

 

        long baseline = 0;

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // Regular for loop

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                for (int j = 0, c = array.Length; j < c; j++)

                    DoNothing(array[j]);

            }

            baseline = sw.ElapsedTicks;

            Console.WriteLine("For loop (int[])\t{0} tcks\t% of baseline", baseline);

        }

 

        // Regular for loop (IList<int>)

        {

            Stopwatch sw = Stopwatch.StartNew();

            IList<int> ia = array;

            for (int i = 0; i < iters; i++)

            {

                for (int j = 0, c = ia.Count; j < c; j++)

                    DoNothing(ia[j]);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("For loop (IList<int>)\t{0} tcks\t{1}%",

                elapsed, 100*(elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // Regular foreach loop

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                foreach (int x in array)

                    DoNothing(x);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("ForEach loop (int[])\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // Regular foreach loop

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                IEnumerator<int> e = ((IEnumerable<int>)array).GetEnumerator();

                while (e.MoveNext())

                    DoNothing(e.Current);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("int[] IEnumerator<int>\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        // IEnumerator<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                IEnumerator<int> e = ((IEnumerable<int>)list).GetEnumerator();

                while (e.MoveNext())

                    DoNothing(e.Current);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IEnumerator<int>\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // IFastEnumerator<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                int x = 0;

                IFastEnumerator<int> e = ((IFastEnumerable<int>)list).GetEnumerator();

                while (e.MoveNext(ref x))

                    DoNothing(x);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IFastEnumerator<int>\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // IForEachable<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                Action<int> act = new Action<int>(DoNothing);

                ((IForEachable<int>)list).ForEach(act);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IForEachable<int> [s]\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // IForEachable<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                DoNothingClosure dnc = new DoNothingClosure();

                Action<int> act = new Action<int>(dnc.DoNothing);

                ((IForEachable<int>)list).ForEach(act);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IForEachable<int> [i]\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

        GC.Collect();

        GC.WaitForPendingFinalizers();

 

        // IForEachable2<T>

        {

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < iters; i++)

            {

                DoNothingFunctor dnf = new DoNothingFunctor();

                ((IForEachable2<int>)list).ForEach(dnf);

            }

            long elapsed = sw.ElapsedTicks;

            Console.WriteLine("IForEachable2<int>\t{0} tcks\t{1}%",

                elapsed, 100 * (elapsed / (float)baseline));

        }

 

    }

 

    [System.Runtime.CompilerServices.MethodImpl(

        System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]

    private static void DoNothing(int x) { }

 

    class DoNothingClosure

    {

        [System.Runtime.CompilerServices.MethodImpl(

            System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]

        public void DoNothing(int x) { }

    }

 

    class DoNothingFunctor : Functor<int>

    {

        public override void Invoke(int x) { DoNothing(x); }

    }

}

To summarize, .NET enumeration costs something over typical for loops that index straight into arrays.  Most programs needn’t worry about these kinds of overheads.  If you’re accessing a database, manipulating a large complicated object, or what have you, inside of the individual iterations, then the overheads we’re talking about here are miniscule.  In fact, walking 1,000,000 elements is in the microsecond range for all of the benchmarks I showed, even the slowest ones.  So none of this is anything to lose sleep over.  But if you have a closed system that controls all of its enumeration, it may be worth doing some targeted replacement of enumerators with the more efficient patterns, particularly if you tend to enumerate lots and lots of elements lots and lots of times in your program.

9/21/2008 12:14:58 AM (Pacific Daylight Time, UTC-07:00)  #    Comments [8]

 

Recent Entries:

Search:

Browse by Date:
<September 2008>
SunMonTueWedThuFriSat
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...