| |
 Friday, April 12, 2013
I am naturally drawn to teams that work at an insane pace. The momentum, and persistent drive to increase that momentum, generates amazing results. And it's crazy fun.
In such environments, however, I've found one thing to be a constant struggle for everybody on the team -- leaders, managers, and individual doers alike: remembering to take the necessary time to do the right thing. This sounds obvious, but it's very easy to lose sight of this important principle when deadlines loom, customers and managers and shareholders demand, and the overall team is running ahead at a breakneak pace.
A nice phrase I learned from a past manager of mine was, "sometimes you need to slow down to speed up."
By taking shortcuts today, though attractive in that they help meet that next closest deadline, you almost always pay for them down the road. You might subsequently become quagmired in bugs because quality was comprimised from the outset. You may create a platform that others build upon, only to realize later that the architecture is wrong in need of revamping, incurring a ripple effect on an entire software stack. You may realize that your whole system performs poorly under load, such that just when your startup was beginning to skyrocket to success, users instead flee due to the poor experience. The manifestation differs, but the root cause is the same.
The level of quality you need for a project is very specific to your technology and business. I'll admit that working on systems software demands different quality standards than web software, for example. And the quality demands change as a project matures, when the focus shifts from writing reams of new code to modifying existing code... although the early phases are in fact the most challenging: this is when the most critical cultural traits are not yet set but are developing, when things have the highest risk of getting set off in the wrong direction, and is when you are most likely to scrimp on quality due to the need to make rapid progress on a broad set of problems all at once.
So how do you ensure people end up doing the right thing? Well, I'd be lying if I didn't say it is a real challenge.
As a leader, it is important to create a culture where individuals get rewarded for doing the right thing. Nothing beats having a team full of folks that "self-police" themselves using a shared set of demanding principles.
To achieve this, leaders needs to be consistent, demanding, and hyper-aware of what's going on around them. You need to be able to recognize quality versus junk, so that you can reward the right people. You need to set up a culture where critical feedback when shortcuts are being taken is "okay" and "expected." I've made my beliefs pretty evident in prior articles, however I simply don't believe you can do this right in the early days without being highly technical yourself. As a team grows, your attention to technical detail may get stretched thin, in which case you need to scale by adding new technical leaders that share, recognize, and maintain or advance these cultural traits.
You also can't punish people for getting less done than they could have if they took those shortcuts. Many cultures reward those who hammer out large quantities of poorly written code. You get what you reward.
In fact, you must do the opposite, by making an example out of the people who check in crappy code.
Facebook has this slogan "move fast and break things." It may seem that what I'm saying above is at odds with that famous slogan. Indeed they are somewhat contradictory, however paradoxically they are also highly complementary. Although you need to slow down to do the right thing, you do also need to keep moving fast. If that seems impossible, it's not; but it sure is difficult to find the right balance.
I have a belief that I'm almost embarassed to admit: I believe that most people are incredibly lazy. I think most quality comprimise stems from an inherent laziness that leads to details being glossed over, even if they are consciously recognized as needing attention. The best developers maintain this almost supernatural drive that comes from somewhere deep within, and they use this drive to stave off the laziness. If you're moving fast and writing a lot of code, strive to utilize every ounce of intellectual horsepower you can muster -- sustained, for the entire time you are writing code. Even if that's for 16 hours straight. If at any moment a thought occurs that might save you time down the road, stop, ponder it, course correct on the fly. This is a way of "slowing down to speed up" but in a way where you can still be moving fast. Many lazier people let these fleeting thoughts go without exploring them fully. They will consciously do the wrong thing because doing the right thing takes more time.
I've developed odd habits over the years. As a compile runs, I literally pore over every modified line of code, wondering if there's a better way to do it. If I see something, I push it on the stack and make sure to come back to it. By the time I've actually commited some new code -- regardless of whether it's 10,000 lines of freshly written code, or a 10 line modification to some existing stuff -- chances are that I've read each line of code at least three times. I disallow any detail I see to slip through the cracks. And my mind obsesses over all aspects of my work, even during "off times" (e.g., eating dinner, walking down the hallway, etc). Each of these opportunities represents a chance to slow down, reflect, and course correct.
Do I still miss thing? Sure I do. But that's why it's so critical to have a team around you who shares the same principles and will help to identify any shortcomings that I've missed.
Another practice I encourage on my team is fixing broken windows. I'm sure folks are aware of the so-called broken windows theory, where neighborhoods in which broken windows are tolerated tend to accumulate more and more broken windows with time. It happens in code, too. If people are discouraged from stopping to fix the broken windows, you will end up with lots of them. And guess what, each broken window actually slows you down. As more and more accumulate, it can become a real chore to get anything meaningful done. I guarantee you will not be able to move very fast if too many broken windows pile up and start needing attention. Slowing down to fix them incrementally, as soon as they are noticed, speeds you up down the road.
Building a quality-focused team isn't easy. But creating a culture that slows down to do the right thing, while simultaneously moving fast, provides an enormous competetive advantage. It's not as common as you might think.
 Thursday, April 11, 2013
I mentioned a few months back that my team had collaborated with MSR to publish a paper to OOPSLA about some novel aspects of our programming language (see here and here).
I was excited when Jonathan over at InfoQ asked to interview me about this work. We had a fun back and forth, and I hope the result helps to clarify some of the design goals and decisions we made along the way.
You can check it out here: Uniqueness and Reference Immutability for Safe Parallelism.
 Saturday, March 16, 2013
It's really hard to build a great team. It can take years of hard work and an enormous amount of patience.
The reality is that there's only a finite (read: small) number of truly amazing software developers in the world, especially compared to the opportunities and exciting projects available to them.
And yet, great teams are fueled first and foremost by great people. I often liken this to the aphorism "a rising tide lifts all boats."
The original meaning of the phrase of course had nothing to do with software. It was the notion that focusing on growth of the overall economy's GDP will necessarily have a positive impact on the incomes of individuals within that economy. Now, of course, it's not always true, and I'm no theoretical economist, however the basic idea in spirit is an intuitively interesting one.
Applying this thinking to teams, it implies you should always strive to hire better and better people. That by doing so, the overall quality of the team will rise. Hiring better and better people has a nonlinear impact to the culture, because a team is not just a disjoint set of nodes, but is instead a fully connected graph of individuals who have conversations and collaborate together. A greater overall quality of the team means richer connections and more powerful, higher quality innovation and software. It means your chance of truly changing the world has grown nonlinearly as well.
I strive to only hire people who are better than me, and better than people already on the team, in some interesting dimension. As soon as you let your high standards drop even an ounce, the average drops and there is a cumulative snowballing effect. The connections grow weaker, and a nonlinear drop in quality and innovation will occur. This is my nightmare scenario because it can go downhill very quickly.
This applies to an entire company as well as individual teams. Including what can happen should the tides lower. The brain drain begins as a slow drip, and can quickly turn into a torrential downpour in an instant. It often starts from the top, because culture and hiring start from the top.
Now, I will be the first to admit that raising the tide is hard. Damn hard, in fact. I have another phrase which is "always be on." That incredible engineer you worked with ten years ago just might be the piece missing in the puzzle today, and a good way to lift the boats. Opportunities come and go when you least suspect them, and you want those people to want to join your team. I have several individuals that literally took years of effort to recruit. And the wait was well worth it. This advice applies to individual contributers as much as it does to managers. You never know if in a few years, you'll be leading a team, kicking off your own startup, or even just helping to make your own team a better place.
And as a leader you owe all of this to your existing team. By lifting the boats, your entire team benefits. They grow, learn new things, and reach new heights in their own careers.
Despite being hard work, this all pays off the end. There is very little I find more satisfying in life than building and growing a great team, seeing the year over year improvements, and creating amazing things together. Perhaps even more than coding. (gasp)
 Sunday, March 03, 2013
The very notion of "authority" is 90% in your head. And it's one that often holds back otherwise very capable people.
This is yet another one that I got entirely wrong early in my own career.
When starting a new job, it's natural to be in what I call "understanding and assessment" mode. In fact, coming into a new job and telling everybody how they are doing things wrong is a recipe to not only get you off on the wrong foot, but also permanently poison your relationship with what would have been very important allies down the road. However, it's critical to turn the corner at some point before it's too late. The more experience you have, the less time it should take.
When I first came to Microsoft, I was suddenly surrounded by lots of smart people with momentum and energy on whatever it was that they were building. This led me to initially assume that these people knew what was going on. It led me to assume that, simply because some guy has the title of "Corporate Vice President" or "Distinguished Engineer", they knew what was going on. In fact, in my first day on the job -- and I will never forget this -- I was in a design meeting where I had the "audacity" to tell another DE (Microsoft's highest ranking technical position at the time) that I disagreed with what he was saying. I was polite about it, and to this day think I was right. However, I was pulled aside afterwards and told how stupid a move that was. And what's worse, I actually listened for my first two years and went out of my way not to rock the boat too much. This was against my better judgment. I was still young, and had come from another job where I was confident and could safely question anything; but I made the silly mistake of thinking "well, maybe things are different around here." I still wish I could get those two years back.
Allow me to let you in on a little secret. (Well, okay, it's not really a secret, but if only I could go back and tell my younger self this. And I suppose it ought to be obvious.) These people don't always know what is going on. It's probably safe to assume that these people have been rewarded in their careers because, statistically speaking, they are right more often than they are wrong. But it's still just statistics. And truthfully, if they are any good, they will like being questioned. They enjoy the technical debate. This is a critical aspect of a great team.
In fact, if they don't enjoy the debate, you are likely in the wrong place. The person who told me I was being stupid was actually right. The organization I had joined punished, rather than rewarded, people who questioned people in positions of authority. As soon as I realized this, I got out. It's a very personal preference, I suppose, but I personally prefer organizations that reward and promote people based on ability and direct impact. Sadly, in organizations where authority prevails, advancement is almost always based on who-likes-who, ass-kissing, and time-in-position. For folks looking for cozy jobs with guaranteed income, perhaps this is ideal; but you're quite unlikely to grow rapidly, build amazing things, and change the world in such places.
The employees I love the most are those that ask tons of questions and aren't afraid to tell me when I'm wrong. These people are inquisitive about everything, whether the topic is highly technical or pure business. By questioning my own views, and forcing me to articulate them, there is an overall strengthening of the culture. Not only do I benefit as a leader by having to methodically think through and defend my approach to problem solving, however those around me also benefit because (a) often they end up influencing the organization in big ways, and (b) even if my original stance survives, they understand the rationale behind certain decisions and can grow as a result. And it's fun! -- albeit passionately heated at times.
In fact, it's painful for me to see the opposite. An employee who has been trained to blindly respect authority. I was an admittedly rebellious youngster, so I've always known that this trait is simply not in my nature. I often reflect on how lucky I am to have landed in software rather than the military. Even when making those mistakes early in my career, I knew in my heart that they were mistakes at the time. But I do know that some people feel comfortable with a hierarchy of authority. They like the structure, and questioning it simply isn't an option. Sadly, some such people are beyond repair.
These days, I have to say that kids who grew up programming in their teens are the most fearless and rebellious. I have a dirty admission: I love hiring and mentoring and growing these individuals the most, because they have yet to be "trained" to respect authority. Once a vulnerable person early in their career has been brainwashed, it is an incredibly difficult thing to reverse. And so, as their career progresses, the longer these habits sit unaddressed, the less salvageable they become. Thankfully I caught it early on. I've managed plenty of folks who didn't.
Now, you can't be an asshole about it. And you can't be arrogant. Software is all about people and collaboration, and all of this questioning must be done with a single goal in mind: to make the software, the organization, and/or its people better. Authority is there for a reason, which is that ultimately someone needs to run the business, make decisions, and have their butt on the line. Sometimes the simple reality is that a leader's intuition is extremely good, and though data may be lacking to support the decision, you can trust it. It's okay to agree to disagree, or sometimes admit that someone simply has a stronger background than you in a particular area and so maybe you aren't in a position to fully understand why a certain decision was made. I have always tried to turn such occasions into learning opportunities. Jot down a few notes, and go read about it afterwards. I always jot down and research any term I hear that I'm not totally on top of, technical or business. It happens all the time.
In your next job interview, go out of your way to question a thing or two. If the person on the other end acts offended, either you asked in the wrong way (remember: respectful but inquisitive), or you shouldn't take the job. If it's a startup, read the business plan ahead of time and come prepared with some hard questions. If it's a corporation, ask what is rewarded, find holes in the technical architecture, question areas of the engineering process that could be improved. I really do think this is one of the most important cultural traits of a well-run team. And I guarantee you'll have way more fun on such a team, perhaps the most important cultural trait of all.
 Sunday, February 17, 2013
The best people in software have an innate ability to communicate using code. They have an idea and simply code it up, thereby making it reality. In fact, the best people are, I would say, obsessed with code.
Pick somebody in software that has done great things. Bill Gates comes to mind for me, because that’s who inspired me to get started in software. He wrote code for as long as he could manage, and famously delivered code reviews even as his company grew to 1,000s of engineers. No matter who you pick, I am sure one thing rings true: they obsessed over the details. And when it comes to software, those details are in the code.
Those who cannot read and write code must spend all of their time convincing other people of their ideas, and are usually sufficiently disconnected from reality (i.e., the code) that their ideas do not work in practice. This is an awful situation to be in, particularly at a company whose primary asset is code. Worse, most people voluntarily place themselves into this category, particularly over time in their careers, because they believe that coding is "not one of their job responsibilities." What rubbish!
I have three particular pet peeve examples to give.
The first is what I like to call the "mediocre mid-level manager syndrome." I’ll admit that when you manage large enough teams, you have to give up on a bit of coding. I will personally never give it up entirely, even as I manage teams of 1,000s of engineers. I will always use the product my team is building, I will read the checkins to at least understand what’s going on and stay grounded, and, assuming I continue to manage groups building development platforms, I will write code using that platform. But for managers who assume responsibility for 10 or fewer engineers, there is absolutely no excuse for slacking in these areas. It’s just pure laziness, and the teams suffer enormously: such teams typically lack "adult supervision" in the area of engineering culture, lack a role model, and build wrong and crappy things. In short, such managers literally add negative value. I can’t tell you how many "Software Development Leads" at large corporations fall into this category. That this is often culturally accepted is totally broken; needless to say, it is not acceptable on my teams.
The second is something I call "code is beneath me." The two most prominent examples are folks late in their careers and researchers. The former often goes hand in hand with the mid-level manager problem. But I’ve seen it afflict software engineers too: "I’ve been a professional developer for 10 years, so my job is now to tell others what to do rather than doing anything myself." At this point, they might adopt the title Architect. The research issue, however, quite frankly perplexes me. Computer Science is an odd mixture of pure math and applied engineering. I get that many CS researchers are more math-oriented, and wish to basically do mathematics rather than software. I also get that much of this research bears fruit. But in my experience there is a very large contingency of researchers that do not produce "first rate mathematics," and yet resist becoming grounded in code. The idea that you can improve the state of software, whose bloodline is code, without ever writing a line of it or becoming proficient in it, is complete insanity. And yet it’s generally accepted.
The last example, which is close to home since I made this mistake myself for a couple years early in my career, is "I manage things, I don’t build them." The title of Program Manager is a specific manifestation of this problem. Most have backgrounds in CS, and have probably written a little code. But most PMs are also usually not very good at it, don’t love it, and probably haven’t written much since leaving school. And yet these people are often "in charge" of making decisions about features, prioritization, and competitive offerings. It’s true that some people have great intuition and can make some good decisions without knowing how things work. But when it comes to software, those abilities need to be grounded by the code. If that’s not interesting to them, I always encourage considering positions in sales and marketing, HR, or one of the other organizations in such companies that isn’t focused on actually building the software. I would literally abolish the PM position at my company if I could. Those who love code should become developers.
I have the utmost respect for people who have fallen into any of these traps, but then realizes it and gets out. Hey, I did so myself.
Even those that write code often don’t do it enough. I’ve seen so many fall into the trap of debating whether or not something would work, or how elegant it would be. Certain people are afraid of failure, or find it difficult to get motivated to "start" coding. The best people, however, realize that questions are easily answered by writing the code in prototype form. They go from 0-60 in an instant, having a vision of what they would like to build, and letting nothing get in the way. I call this "oozing code from your fingertips." I do think some of this is a skillset thing. In software, the top 20% are easily 50X more productive than the bottom 20%. But I also think these traits can be learned, given role models who exhibit and demonstrate the behavior.
Finally, I do encourage software leaders to read as much code as they can. Reading code is a great way to learn how things work, and to stay on top of what’s actually happening in your project. And it keeps your mind fresh, and often leads to new ideas.
If it isn’t obvious, I might have a slightly atypical bias here. But it’s one of the things I am most passionate about with respect to running software teams. Code speaks. Love the code.
I’ve been managing software teams for several years. Perhaps more importantly, I have worked for some excellent leaders and have had the opportunity to learn from their good (and bad) habits.
Because I haven’t written a line of .NET code in a few years now, that blogging well has kind of run dry. And sadly my team is not yet ready to openly share our platform externally, so I cannot blog about that either.
As a result, I thought it would be fun to start a series about leadership in software. Not just the kind of leadership expected of managers, but also individual developers and architects. I have no idea how frequently I’ll write something, however just having a continuum of content to contribute to when I have a spare moment will help liven this place back up again, I’m sure. Furthermore, one lesson that’s been imparted upon me over the years is that "writing is thinking"; so by writing this stuff down, I’m sure it will crystalize even further.
The series will be called "Software Leadership" because, after all, it’s about the software. I hope you enjoy.
 Saturday, December 08, 2012
I mentioned recently that a paper from my team appeared at OOPSLA in October:
Uniqueness and Reference Immutability for Safe Parallelism (ACM, MSR Tech Report [PDF])
It's refreshing that we were able to release it. Our project only occasionally gets a public shout-out, usually when something leaks by accident. But this time it was intentional.
I began the language work described about 5 years ago, and it's taken several turns of the crank to get to a good point. (Hint: several more than even what you see in the paper.) Given the novel proof work in collaboration with our intern, folks in MSR, and a visiting professor expert in the area, however, it seemed like a good checkpoint that would be sufficiently interesting to release to the public. Perhaps some day Microsoft's development community will get to try it out in earnest.
There seems to have been some confusion over the goals of this work. I wanted to take a moment to clear the air.
First, despite assertions elsewhere, the primary focus of this work was not "implicit parallelism." Instead, I would summarize our goals as:
- Create a single language that incorporates the best of functional and imperative programming. Both offer distinct advantages, and we envisioned marrying the two.
- Codify and statically enforce common shared-memory state patterns, such as immutability and isolation, with minimal runtime overhead (i.e., virtually none).
- Provide a language model atop which can be built provably safe data, task, and actor-oriented concurrency abstractions. Both implicit and explicit. This includes, but is not limtied to, parallelism.
- Do all of this while still offering industry-leading code quality and performance, rivaling that of systems-level C programs. Yet still with the safety implied by the abovementioned goals.
The language features in the paper are a vast subset of the full suite needed to achieve our overall project goals. However, these alone have exceeded our original expectations.
I've programmed a great deal in functional languages. I'm a long-time lover of LISP and ML, and my closest friends know about my hard-core dedication to Haskell (expressed in an admittedly odd manner). In fact, Haskell's elegant marriage of pure functional programming with monads, notably the state monad, was a major inspiration for the design of the type system. There are of course many other influences, such as regions, linear types, affine types, etc.; however, I'd say Haskell was the strongest.
In some sense, we have simply taken the reverse angle of Haskell with its monads: what would it be like to embed pure functional programming within an otherwise imperative language?
This first goal is proving to be my fondest aspect of the language. The ability to have "pockets of imperative mutability," familiar to programmers with C, C++, C#, and Java backgrounds, connected by a "functional tissue," is not only clarifying, but works quite well in practice for building large and complex concurrent systems. It turns out many systems follow this model. Concurrent Haskell shares this high-level architecture, as does Erlang. Well-written C# systems do the same, though the language doesn't (yet) help you to get it right.
Of course, as called out by the second goal, immutability and controlled side-effects are tremendously useful features on their own. Novel optimizations abound.
And it helps programmers declare and verify their intent. As mentioned in the paper, we have found/prevented many significant bugs this way. Did you ever want to verify that your contracts and assertions are pure, such that conditional compilation doesn't change the outcome of your program? Or that your sort comparator isn't mutating the elements while performing its comparisons? Neither has much to do with concurrency, although the latter facilitates parallel sorts. Many other systems introduce specific verification techniques to address specific problems, rather than employing a general purpose type system.
I would say the strength with respect to concurrency is not the type system itself, but rather what you can do with it.
The focus on implicit parallelism in the recent forum discussions was unfortunate. I guess "implicit parallelism" just makes for catchy and controversial titles. Yes, the type system makes implicit parallelism "safe and possible," some forms of which are indeed profitable, but it's not as though suddenly all of your for loops are going to run 8-times faster after a recompile. The optimization angle is an orthogonal, but very real, concern. There are decades of research and experience here.
Even when tasks are explicitly spawned, however, the fact that the type system catches unsafe mutable state capture that would lead to race conditions is, I dare say, game changing. I could never go back to the old model of instruction-level races, which now-a-days feels like programming a PDP6 to me (no insults implied). And yes, data parallel works great in this model. It may take a bit of imagination, rereading the article, and perhaps looking at related work such as Deterministic Parallel Java, to understand how, but it does.
The effort grew out of my work on Software Transactional Memory in 2004, then Parallel Extensions (TPL and PLINQ), and then my book, a few years later. I had grown frustrated that our programming languages didn't help us write correct concurrent code. Instead, these systems simply keep offering more and more unsafe building blocks and synchronization primitives. Although I admit to contributing to the mess, it continues to this day. How many flavors of tasks and blocking queues does the world need? I was also dismayed by the oft-cited "functional programming cures everything" mantra, which clearly isn't true: most languages, Haskell aside, still offer mutability. And few of them track said mutability in a way that is visible to the type system (Haskell, again, being the exception). This means that races are still omnipresent, and thus concurrent programs expensive and error prone to write and maintain.
Reflecting back, I am somewhat amazed that the language has taken so long to hatch. Type systems that are sound and strike the right balance of utility and approachability are hard work!
I am ecstatic that we've been able to make inroads towards solving these hard problems. My team is, quite simply, an amazing group of people, and without them the ideas would have never made it beyond the "that will never work" phase. I look forward to sharing more about our work in the years to come.
 Tuesday, October 30, 2012
.NET holds an enormous advantage over C++.
Well, okay, there are a few, but I’m thinking about one in particular: A single string type.
What’s not to love about that? Anybody who has done more than an hour’s worth of Windows programming in C++ should appreciate this feature. No more zero-terminated char* vs. length-prefixed char* vs. BSTR vs. wchar_t vs. CStringA vs. CStringW vs. CComBSTR. Just System.String. Hurray!
There’s one very specific thing not to love, however: The ease with which you can allocate a new one.
I’ve been working in an environment where performance is critical, and everything is managed code, for several years now. That might sound like an oxymoron, but our system can in fact beat the pants off all the popular native programming environments. The key to success? Thought and discipline.
We, in fact, love our single string type. And yet our team has learned (the hard way) that string allocations, while seemingly innocuous and small, spell certain death.
It may seem strange to pick on string. There are dozens of other objects you might allocate, like custom data types, arrays, lists, and whatnot. But there tend to be many core infrastructural pieces that deal with string manipulation, and if you build atop the wrong abstractions then things are sure to go wrong.
Imagine a web stack. It’s all about string parsing and processing. And anything to do with distributed processing of data is most likely going to involve strings at some level. Etc.
There are landmine APIs lurking out there, like String.Split and String.Substring. Even if you’ve got an interned string in hand (often rare in a server environment where strings are built from dynamically produced data), using these APIs will allocate boatloads of tiny little strings. And boatloads of tiny little strings means collections.
For example, imagine I just want to perform some action for each substring in a comma-delimited string. I could of course write it as follows:
string str = ...;
string[] substrs = str.Split(',');
foreach (string subtr in substrs) {
Process(substr);
}
Or I could write it as follows:
string str = ...;
int lastIndex = 0;
int commaIndex;
while ((commaIndex = str.IndexOf(',', commaIndex)) != -1) {
Process(substr, lastIndex, commaIndex);
lastIndex = commaIndex + 1;
}
The latter certainly requires a bit more thought. That’s primarily because .NET doesn’t have an efficient notion of substring – creating one requires an allocation. But the performance difference is night and day. The first one allocates an array and individual substrings, whereas the second performs no allocations. If this is, say, parsing HTTP headers on a heavily loaded server, you bet it’s going to make a noticeable difference.
Honestly, I’ve witnessed programs that should be I/O bound turn into programs that are compute-bound, simply due to use of inefficient string parsing routines across enormous amounts of data. (Okay, the developers also did other sloppy allocation-heavy things, but string certainly contributed.) Remember, many managed programs must compete with C++, where developers are accustomed to being more thoughtful about allocations in the context of parsing. Mainly because it’s such a pain in the ass to managed ad-hoc allocation lifetimes, versus in-place or stack-based parsing where it’s trivial.
"But gen0 collections are free," you might say. Sure, they are cheaper than gen1 and gen2 collections, but they are most certainly not free. Each collection is a linked list traversal that executes a nontrivial number of instructions and trashes your cache. It’s true that generational collectors minimize the pain, but they do not completely eliminate it. This, I think, is one of the biggest fallacies that plagues managed code to this day. Developers who treat the GC like their zero-cost scratch pad end up creating abstractions that poison the well for everybody.
Crank up .NET’s XmlReader and profile loading a modest XML document. You’ll be surprised to see that allocations during parsing add up to approximately 4X the document’s size. Many of these are strings. How did we end up in such a place? Presumably because whoever wrote these abstractions fell trap to the fallacy that "gen0 collections are free." But also because layers upon layers of such things lie beneath.
It doesn’t have to be this way. String does, after all, have an indexer. And it’s type-safe! So in-place parsing at least won’t lead to buffer overruns. Sadly, I have concluded that few people, at least in the context of .NET, will write efficient string parsing code. The whole platform is written to assume that strings are available, and does not have an efficient representation of a transient substring. And of course the APIs have been designed to coax you into making copy after copy, rather than doing efficient text manipulation in place. Hell, even the HTTP and ASP.NET web stacks are rife with such inefficiencies.
In certain arenas, doing all of this efficiently actually pays the bills. In others arenas, it doesn’t, and I suppose it’s possible to ignore all of this and let the GC chew up 30% or more of your program’s execution time without anybody noticing. I’m baffled that such software is written, but at the same time I realize that my expectations are out of whack with respect to common practice.
The moral of the story? Love your single string type. It’s a wonderful thing. But always remember: An allocation is an allocation; make sure you can afford it. Gen0 collections aren’t free, and software written to assume they are is easily detectible. String.Split allocates an array and a substring for each element within; there’s almost always a better way.
 Sunday, October 28, 2012
A glimpse of some research we've done recently just appeared at OOPSLA last week:
Uniqueness and Reference Immutability for Safe Parallelism
A key challenge for concurrent programming is that side-effects (memory operations) in one thread can affect the behavior of another thread. In this paper, we present a type system to restrict the updates to memory to prevent these unintended side-effects. We provide a novel combination of immutable and unique (isolated) types that ensures safe parallelism (race freedom and deterministic execution). The type system includes support for polymorphism over type qualifiers, and can easily create cycles of immutable objects. Key to the system's flexibility is the ability to recover immutable or externally unique references after violating uniqueness without any explicit alias tracking. Our type system models a prototype extension to C# that is in active use by a Microsoft team. We describe their experiences building large systems with this extension. We prove the soundness of the type system by an embedding into a program logic.
The official ACM page is here, and a tech report version is available on MSR's website.
As I said, this is just a glimpse. Its focus was mainly on the type soundness work we've done jointly with MSR, and less about the language, syntax, and uses. You'll have to use your imagination to fill in the rest 
 Tuesday, July 17, 2012
It's been quite some time since I blogged.
The reason is simple, as always: I'm having the time of my life at work.
I will try to do better blogwise in the coming months. But in the meantime ...
If you'd like to join me, Adam, and Krzysztof in our quest to build the best APIs and developer platform known to mankind, please shoot me an email with you resume. Or just apply.
In short, you could be having the time of your life too. Why wait?
 Saturday, November 12, 2011
I often wish that .NET had erred on the side of offering postmortem instead of premortem finalization.
The distinction here is when exactly the finalizer runs, i.e. after or before the GC has actually reclaimed an object. This governs whether a dying object is (a) accessible from within its own finalizer, and therefore (b) eligible to become resurrected. Postmortem finalization occurs after the object is long gone, and hence says “no” to both of these questions; premortem finalization happens beforehand and hence says “yes.”
.NET chose the latter.
The primary downside of premortem finalization, setting aside the confusing nature of resurrection, is that the object in question cannot be collected until after its finalizer has run. This should be fairly obvious: it is only that second time the object is found to be dead “again” that we know the finalizer has or has not resurrected it.
This may seem like a small matter. But it matters quite a lot when building high performance software. In a garbage collected system, relying on high rates of finalization to keep up with demanding workloads almost never works. But in a premortem finalization system, even moderate demands become cause for concern.
Premortem finalization leads to finalized objects getting promoted to the elder generations before actually dying. If you check the value of GC.GetGeneration(this) within an object’s finalizer, for example, you will notice it is one greater than the generation in which the object was found to be dead the first time. Say it was found dead in Gen1; then GC.GetGeneration(this) will return ‘2’. Yet another collection must happen, in Gen2 to boot, in order to actually reclaim this object. And, of course, it’s not just this object, but also the transitive closure of objects to which it refers.
This approach penalizes the majority use case of finalizable objects. At least on .NET, most objects merely invoke CloseHandle on an IntPtr in the finalizer. This clearly needn’t hold up freeing the managed state. And resurrection is a dubious scenario anyway: such objects quickly end up in Gen2 where collections are expensive and infrequent. If you’re pooling via resurrection because you create expensive objects at a high rate of birth and death, manual memory management (or a different design altogether) is likely your only savior.
Although Java’s finalizers are also premortem, the JDK offers the facilities necessary to implement postmortem finalization on your own. It entails using WeakReference and ReferenceQueue. See this article if you are curious.
.NET doesn’t offer the notifications required to do the same. You can, however, learn from postmortem finalization to write better premotem finalizers: prefer simple finalizable objects that refer to only the state necessary to implement finalization – which ordinarily means no other managed objects. The SafeHandle abstraction is a good example of this. Most implementations are comprised of a simple IntPtr. This pattern will ensure that collateral promotion due to finalization is more contained.
After saying all of this, I hope it is just amusing trivia. I'm sure nobody is writing finalizers these days anyway.
 Sunday, October 23, 2011
It's been unbelievably long since I last blogged.
The reason is simple. I've been ecstatic in my job and, every time I think to write something, I quickly end up turning to work and soon find that hours (days? months?) have passed. This is a wonderful problem to have, but not so good for keeping the blog looking fresh and new. (I've also been writing a fair bit of music lately.) Well, this weekend I managed to lock myself out of my VPN access, and decided that this was a sign that I ought to dust off the cobwebs on a blog entry or two that I've had in the works for quite some time.
The topic for today is generics, a feature many of us know and love. Specifically, their impact on software performance, something I frequently see developers struggling to understand and tame in the wild.
The blessing; and, the curse
I absolutely love generics. I can hardly imagine writing code without them these days. The code reuse, higher-order expressiveness, beautiful abstractions, and static type-safety enabled by first class parametric polymorphism are all game-changing. And being a language history wonk, I'm delighted to see many mainstream programming languages stealing a page from ML and theoretical CS generally.
Generics, however, are not free. And in some circumstances, they are, dare I say, rather expensive. Few language features surpass generics in the ability to write a concise and elegant bit of code, which then translates into reams of ugly assembly code out the rear end of the compiler. I am of course speaking mainly to models in which compilation leads to code specialization (like .NET's), versus erasure (like Java's).
Most developers coming from a C++ background understand code expansion deeply, because they program with templates. Unlike templates, however, there is ample runtime type information (RTTI) associated with generic instantiations… such that the costs associated with generics frequently – and perhaps surprisingly – are a superset of those costs normally associated with C++ templates. At the same time, because the compiler understands parametric polymorphism, it can sometimes do a better job optimizing, e.g. with techniques like code sharing.
Basically, with templates and erasure, the equation for predicting code expansion is super simple. You get it all (in the former) or you get none of it (in the latter), but with specialized generics this equation is quite complex.
Paradoxically, these same costs are the main value that generics bring to the table! Write a little type-agnostic code and then "instantiate" that same code over multiple types without repeating yourself. But, generics are not magic; did you ever stop to wonder things like: What machine code is generated for these types? Does the compiler need to specialize the actual code that runs on the processor for unique instantiations, or is it all the same? And if it does need to specialize, where, how, and why? And perhaps most importantly, what hidden costs are there, and how should I think about them while writing code?
Before reading further, paranoia need not ensue. The point of this article is merely to raise your awareness. All programmers should know what the abstractions they use cost, and make conscious tradeoffs when writing code with them. The aforementioned benefits of generics really are often "worth it," both in the elegance and reusability of abstractions, and in developer productivity. In my experience, however, the associated costs are so subtle and ill-documented that even people who write highly generic code typically remain unaware of them. Even more subtly, these costs are somewhat different in nature when pre-compiling your code, such as with .NET's NGen technology.
This brief essay will walk through a few such costs in the context of the .NET Framework and CLR's implementation of generics. This is in no way an exhaustive study of generic compilation, and your mileage will vary from one platform to the next. Although the studies presented would apply to other implementations of generics, the reality is that if you're writing code in, say, Java – where type erasure is employed rather than code specialization – then all of this is going to be less relevant to you.
With no further delay, let's get started.
Code, RTTI, oh my
When considering costs, we must always think about both size and speed.
There is at least as much assembly code created for an instantiation as the code you've written for the generic abstraction in C# or MSIL. A simple mental model – that thankfully turns out isn't entirely accurate, thanks to some sharing optimizations described below – is that for each instantiation of a generic type or method you get a new copy of that code specialized to the type in question. Obviously, this increases code size. And just as obviously, it will add some runtime cost to JIT compile the code (if you aren't using ahead of time compilation), as well as putting more pressure on I-cache and TLB.
Another source of significant cost is the runtime data structures needed for RTTI and Reflection, like vtables and other metadata. Quite simply, the runtime needs to know the identity of each generic instantiation, to prevent things like casting a List<Int16> to a List<String>, and even List<Object> to List<String>; and given that there is often distinct code generated for unique instantiations, the vtable contents for those different List<T> instantiations are going to look quite different.
And of course, there are statics. Each generic instantiation gets its own set, requiring extra storage and another level of indirection when fetching them. Unique statics means D-cache and TLB pressure. It turns out that code shared across AppDomains, like mscorlib.dll, already need such things. But I have found that it's surprisingly common for a developer to throw a static field (or nested class!) onto an outer generic type, without actually needing it to be replicated for each unique instantiation.
In addition to the immediate effects, generic types often refer to other generic types which refer to other generic types … and so on. Instantiating a root type is akin to instantiating the full transitive closure.
To make our discussion friendly and familiar, we shall use the .NET Framework's List<T> type – presumably one of the most commonly used generic types on the planet – to illustrate many of these costs. And unfortunately, you'll also see that many of the common performance pitfalls plague this type too. (So, really, you need not feel bad if your own code is guilty of them too.)
Why the distinct code, anyway?
There is only one copy of List<T>'s code in mscorlib's MSIL. It is essentially just a blueprint for the list class.
When I create a List<Int16> in my program and use it, however, there clearly needs to be some assembly code created in order to execute List<T>'s associated functionality, just with any T's used by List<T>'s code replaced by actual 2-byte short integers. And similarly, if I were to instantiate a List<String>, all those T's need to be replaced by pointer-sized object references, either 4- or 8-bytes depending on machine architecture, that are reported live to the garbage collector.
This is what leads to our simple mental model above, in which each instantiation gets its own copy of the code. In this case, both List<Int16> and List<String> would be entirely independent types at runtime, with wholly separate copies of the machine code.
Certainly if I manually went about creating my own Int16List and StringList types, they would be distinct types with distinct machine code generated. Being a prudent developer, however, I'd probably try to arrange to share as much of the implementation as possible between the two types, perhaps using implementation inheritance. But alas, there's no way I could share it all: any code specific to Int16 or String, for example, would surely differ, both in MSIL and in the native code.
Generics basically give you the ability to do this same thing, without you needing to do the factoring of type-independent and type-specific code yourself. The compiler does that for you.
Why might the code be different? As stated above, Int16 values are 2 bytes and String pointers are native word sized (4 bytes on 32-bit, 8 bytes on 64-bit). All the code that passes values of type T on the stack, either as arguments or return values, moves instances into and out of memory locations (like the T[] backing array), and so on, needs to be specialized based on the size of T. This wouldn't be true of a generics implementation that used type erasure, like Java's, but then you'd need to box the value types on the heap so that everything is a pointer. If T is a Float, we will likely emit code that uses floating point math instead of general purpose registers. Any tables that report GC roots are likely to be different, since object references can be embedded inside struct values that get laid out on the stack. And so on. Some day you might want to compare the machine code for a simple generic Echo<T> method for different kinds of T's; it is really easy to do, and is quite illustrative.
A naïve wish might go as follows. Imagine that I had written my own dedicated Int16List and StringList types, and that we diffed the resulting machine code between the distinct list types; we'd presumably find a fair bit of duplication for all the reasons stated above. It would be a nice property if, when we used the generic List<Int16> and List<String> types, and similarly diffed the resulting assembly, the amount of specialized code would be no greater than the amount of specialized code between our best hand-written Int16List and StringList types. I.e., only parts that need to be different are different.
We could go even further with our wish. Imagine I had a List<DateTime> and List<Int64>. Both are 8-byte values, and do not contain any GC references. If I were writing a specialized 8ByteValueList in C++ and had immense performance constraints, I would, again being a prudent developer, probably use some type unsafe code, with nasty reinterpret_casts, so that I could use the same list type to store any kind of 8-byte value. (Except in C++ I could even store pointers!) It would also be a nice property if generics did some of this for us, while still retaining the type safety we love about generics.
It turns out we will get neither of our wishes exactly, although we will get something close to the spirit of our wishes.
Code sharing
Indeed, the CLR does arrange to share many generic instantiations. The rule is simple, although it is subject to change in the future (being an optimization and all): instantiations over reference types are shared among all reference type instantiations for that generic type/method, whereas instantiations over value types get their own full copy of the code. In other words, List<String> and List<Object> are backed by the same code, but List<DateTime> and List<Int64> get their own.
It is true that, in theory, List<DateTime> and List<Int64> could use the same shared code, because they are of identical size and have GC roots in the same locations (trivially, because neither has one). But there are additional restrictions on generated code that makes this problematic, for example if we were talking about Double and Int64. In short, the CLR doesn't actually share value type instantiations as of the 4.0 runtime, although clearly it could in certain situations (value types of the same size with GC roots in the same locations).
As you might guess, this extends to multi-parameter generics in obvious ways. A Dictionary<Object, Object> is shared with a Dictionary<String, String>, etc., and a Dictionary<Int64, Object> is shared with a Dictionary<Int64, String>. A Dictionary<DateTime, DateTime> is not, however, shared with a Dictionary<Int64, Int64> instantiation, as per the above.
My pal Joel Pobar wrote a post eons ago describing how code sharing works in great detail, which I do not intend to rehash. Please refer to his post for an excellent overview of how code sharing works.
An important thing to remember, however, is that no matter how much code sharing happens, you still need distinct RTTI data structures. So although List<Object> and List<String> share the same machine code, they have distinct vtables; sure, each table is full of pointers to the same code functions, but you are still paying for the runtime data structures. A distinct instantiation, therefore, is never actually free!
Transitive closures
Why am I making such a big deal about code sharing, anyway?
Another surprising aspect of generics is the transitive closure problem. Particularly when doing pre-compilation of generics, each unique instantiation doesn't simply lead to a specialized version of the code associated with the type being directly instantiated. The whole transitive closure of types, starting with that root type, will also be compiled. This can be a surprisingly huge number of types! JIT is much more pay-for-play, such that you get one level of explosion at a time, but once there is code that calls a particular type's method, even if that code is lazily compiled, creation of the type is forced.
To illustrate this, let's take our friend List<T>. Before examining the list, how many generic types would you expect that a single new List<T> instantiation instantiates?
What if I told you that a single List<int> instantiation creates (at least) 28 types? And that, say, five unique instantiations of List<T> might cost you 300K of disk space and 70K of working set? Well, of course, if you are writing a script, or something with fairly loose performance requirements, this might not matter much. But if topics like download time, mobile footprint, and cache performance are important to you, then you probably want to pay attention to this. To a first approximation, size is speed.
Yes, you heard me right: 28 types. Holy smokes... How can this be?!
Nested types are one obvious answer, and indeed List<T> has two: an Enumerator class (which is reasonable), and one to support the legacy synchronized collections pattern (which we presumably wish we didn't have to pay for). The larger answer here, however, is functionality. Yes, functionality! This is a great example where the cost of generics explodes as you add more features. Start simple, keep adding stuff, as has happened to List<T> over the years, and you will soon find that a series of elegant abstractions adds up to a gut-wrenching bucket of bytes.
Here's a quick sketch of the transitive closure of generic types used by List<T>:
List<T>
T[] type
IList<T> type
ICollection<T> type
IEnumerable<T> type
IEnumerator<T> type
ReadOnlyCollection<T> type (AsReadOnly)
(Nothing more than List<T>)
IComparer<T> type (BinarySearch, Sort)
{Array.BinarySearch<T> method (BinarySearch)}
ArraySortHelper<T> type
IArraySortHelper<T> type
GenericArraySortHelper<T> type
EqualityComparer<T> type (Contains)
IEqualityComparer<T> type
IEquatable<T> type
NullableEqualityComparer<T> type
Nullable<T> type
EnumEqualityComparer<T> type
{JitHelpers.UnsafeEnumCast<T> method}
ObjectEqualityComparer<T> type
Predicate<T> delegate type (Find*)
Action<T> delegate type (ForEach)
{Array.LastIndexOf<T> method (LastIndexOf)}
Comparison<T> delegate type (Sort)
Array.FunctorComparer<T> type (Sort)
Comparer<T> type
GenericComparer<T> type
NullableComparer<T> type
ObjectComparer<T> type
{Array.Sort<T> method (Sort)}
ArraySortHelper<T> type (see earlier)
Enumerator inner type
SynchronizedList<T> inner type
IList<T> interface (see earlier)
ICollection<T> interface (see earlier)
IEnumerable<T> interface (see earlier)
{Interlocked.CompareExchange<Object> method (SyncRoot)}
{_emptyArray T[] static field}
I'm not trying to pick on List<T>. This class is only unique in this regard in that it offers a large transitive closure of (mostly useful!) functionality. And it's not the only guilty party. We recently shaved off 100K's of code size on my team, for example, that were being lost simply because all the LINQ methods were declared as instance methods on the base collection class, rather than being extension methods as in .NET. We found nested enumerator and iterator types, cached static lambdas as static fields, and huge transitive closures of other generic types, all allocated when you just touched any collection type. Any collections library is apt to be full of this stuff, since they are highly generic. But collection libraries are certainly not the only places to go sniffing for such problems.
As an aside, it turns out that extension methods are a great way to make generic abstractions more pay-for-play.
Adding it up
Let's see what the above adds up to. I ran some programs through NGen as a quick and dirty experiment, and inspected the on-disk sizes and also the runtime working set sizes. I ensured clrjit.dll was not loaded into the process. Here's what I found. Take these numbers with a grain of salt, as they will change from release to release; they are simply rules of thumb. When in doubt, crank up NGen, DumpBin, and/or start trawling the heap with VADump yourself!
One empty type with no methods in CLR 4.0 seems to cost roughly 0.2K bytes of on-disk metadata, and about 0.7K in x64 working set. (This is a good rule of thumb irrespective of generics… in terms of order of magnitude, you can think "one empty type means 1K of memory.") A single List<S> instantiation, where S is an empty struct, is in the neighborhood of 60K on-disk metadata, and 14K of x64 working set. A single List<C> instantiation, however, where C is an empty class, is only – surprise – about 7K on-disk and 4K in-memory. Why the large discrepancy? Well, it just so happens that mscorlib.dll already includes an instantiation or two of List<T> over reference types, so this 4K is the incremental cost on top of reusing what is there; remember, there are still unique vtables and data structures still required for RTTI.
Rico did a similar analysis a few years back, and concluded that each unique List, where E was an enum type, cost 8K. Why the increase to 14K over the years? x64 and ever-increasing functionality on the basic collections classes, presumably. Remember, it's not just List<T> that has grown, it's also everything that List<T> uses internally as an implementation detail.
Dynamic specialization with dictionaries
Some specialization in behavior can be accomplished with dynamic runtime behavior, rather than static code specialization. A prime example is the following:
class C
{
public static void M<T>()
{
System.Console.WriteLine(typeof(T).Name);
}
}
Where does the program get the value of typeof(T) from? If you look at the MSIL, you will see that C# has emitted a ldtoken MSIL instruction. For some struct type, we can compile that as a constant in the code, because it is getting its own copy of the code. What occurs when two instantiations share code, like M<String> and M<Object>, however? As you might guess, there is an indirection.
The thing we usually use for such indirections – the vtable – is nowhere to be found in this particular example, because M is a static method. To deal with this, the compiler inserts an extra "hidden" argument, frequently called a generic dictionary, from which the emitted assembly code can fetch the type token. The cost here typically isn't bad, because many of the operations that pull in the dictionary are already RTTI or Reflection-based, and would require an indirection already (e.g., through a vtable).
The operations which require a dictionary of some kind include anything that has to do with RTTI and yet no vtable is readily accessible: typeof, casts, is and as operators, etc. And as you might guess, if instantiations aren't shared (such as with value types on the CLR), no dictionary is needed, because the code is fully specialized. There are also multiple kinds of dictionaries used by the runtime, depending on whether you are using a generic type, method, or some combination of both.
JITting when you didn't mean to
There are two primary ways in which you will JIT compile when using generics, even if you were good doobie and used NGen to reduce startup time.
One way is if you instantiate a new generic type exported from mscorlib.dll with a type argument also defined in mscorlib.dll, that wasn't already instantiated inside mscorlib.dll. (See my old Generics and Performance blog entry for more details.) You can very easily see this happening by using an instantiation like Dictionary<DateTime, DateTime>, and watching the clrjit.dll module getting loaded.
The other way is generic virtual methods (GVMs). It turns out that GVMs pose incredible difficulty for ahead of time separate compilation, because the compiler cannot know statically which slot in the vtable points at the particular implementation you are about to call. (Unless you use whole program compilation, something not offered by .NET at present time.) For each such method, there's an unbounded set of possible specialized instantiations a slot might point to, and so the vtable cannot be laid out in a traditional manner. C++ doesn't allow templated virtual methods for this very reason.
Thankfully, GVMs are somewhat rare. However, it only took 5 minutes of poking around to find one that is quite front-and-center in .NET: in the implementation of LINQ, there is an Iterator<T> type that has a method declared as follows:
public abstract IEnumerable<TResult> Select<TResult>(Func<TSource, TResult> selector);
All we need to do is figure out how to tickle that method, and we're guaranteed to JIT. As it turns out, sure enough, the following code does the trick and forces clrjit.dll to get loaded in .NET 4.0:
int[] xs = …;
int[] ys = xs.Where(x => true).Select(x => x).ToArray();
The Iterator<T> type is used for back-to-back Where and Select operators, as a performance optimization that avoids excess allocations and interface dispatch. But because it depends on a GVM, it does incur an initial penalty for using it, even if you have used NGen to avoid runtime code generation.
In conclusion
The moral of the story here is not that you should fear generics. Beautiful things can be built with them.
Instead, it's to use generics thoughtfully. Nothing in life is free, and generics are no exception to this rule. If code size is important to you, then you will want to have performance gates measuring your numbers against your goals; if you are working in a codebase that uses generics heavily, and you end up spending any significant time on code size optimizations, you will want to try to track down large transitive closures. As I stated above, you could really be throwing away 100K's of code here.
And as to the surprise JITting, I've seen teams compiling with NGen and having a functional gate that fails any new code that causes clrjit.dll to get loaded at runtime. Although tracking down the root cause might be tricky when that gate fails, at least you won't let the camel's nose under the tent.
Investing in tools here is a very good idea.
When it comes down to it, really thinking about what code must be executed by the process is helpful. Step back and imagine you were writing this all in C++, with the associated performance concerns front-and-center: consider how you'd arrange to reuse as much implementation as you can, manage memory efficiently, perhaps employ unsafe tricks that would have violated type safety and so are offlimits in .NET, and all that jazz. Then step back and be grateful that you have a type- and memory-safe environment to help you write more robust code, but also be realistic about what you are paying in exchange.
I hope you've learned a useful thing or two in this article. If you'd like to learn more, here are a few other good resources:
- An MSR paper on the original implementation of .NET generics: http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf
- Rico's "Six Questions About Generics and Performance" blog entry: http://blogs.msdn.com/b/ricom/archive/2004/09/13/229025.aspx
- Joel Pobar's "Generics and Code Sharing" blog post: http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx
- My "Generics and Performance" blog entry: http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx
Cheers.
 Wednesday, June 01, 2011
InfoQ recently asked me a few questions about concurrency and programming languages, and here is what I had to say:
http://www.infoq.com/articles/Interview-Joe-Duffy
A little teaser:
"The major shift we face will be that mainstream languages will start to incorporate more concurrency-safety -- immutability and isolation -- and the platform libraries and architectures will better support this style of software decomposition. OOP developers are accustomed to partitioning their program into classes and objects; now they will need to become accustomed to partitioning their program into asynchronous Actors that can run concurrently. Within this sea of asynchrony will lay ordinary imperative code, frequently augmented with fine-grained task and data parallelism."
As an aside, I know I've been super quiet lately. I never thought I'd go months without blogging. My sincere apologies for this; work has been too all-consuming / fun, and I've been unable to carve out much time for anything else. (Speaking of which, we are still hiring: email me at joedu at you-know-where dot com if you are interested.) I'm about to head out to Europe for a few weeks, where hopefully I'll have a bit of time to write up something exciting to share. Cheers.
 Saturday, December 04, 2010
After spending more time than I’d like to admit over the years researching memory model literature (particularly Java’s terrific JMM and related publications), subsequently trying to “fix” the CLR memory model, reviewing proposals for new C++ memory models, and beating my head against the wall for months developing a new memory model that supports weakly ordered processors like the kind you’ll find on ARM in a developer-friendly yet power-efficient way, I have a conclusion to share.
Volatile is evil
Why? Let me recount the reasons:
- It doesn’t mean what you think.
- It used to have a very specific purpose — to enure memory operations with external side-effects did not get reordered — and has since gotten bastardized and used for many secondarily-related purposes.
- Even if you think it does mean what you think, the annotation scheme is all wrong. Volatile annotates a storage location, and yet what really matters is what happens when accessing said storage locations. The fences occur when you access the variable, not when you declare it. And yet from a readability perspective, they are completely invisible and easy to miss.
- And even if you don’t care about readability, the meaning of volatile changes wildly when you switch platforms. Today it’s store / release, tomorrow it’s write / read fences. Perhaps it’s even sequentially consistent. And the label of “store / release” could actually be a white lie, as with the CLR’s memory model thanks to store buffer forwarding and the lack of fences in the CLR JIT’s x64 stores.
- Performance, man, performance! Sure sequential consistency as the default sounds nice on the tin, but once you’re running that mobile app on ARM, and sucking up 160 cycles for each write you perform, you’re going to curse volatile like the plague.
And so the moral of the story follows...
Attempting to “fix” volatile is a waste of time
Instead, a new world order has arrived. We must take a two-pronged approach to solving instruction-level interleaving bugs, neither prong having much to do with the traditional definition of memory models or volatiles. We must:
- Eliminate memory ordering from 99% of developers’ purviews. This is already the case with single-threaded programs, because code motion in compilers and processors is limited to what only affects concurrent observations. So the answer is pretty clear: developers must move towards single-threaded programming models connected through message passing, optionally with provably race-free fine-grained parallelism inside of those single-threaded worlds.
- Leave the memory model esoterica to the Einsteins, and radically change its meaning. Data dependence and transitive visibility of memory operations are in. Volatiles on storage locations are out. Instead, we must throw fences into programmer’s faces, and force them to understand each and every one that occurs. And moreover, force them to decide about each and every one that occurs. Specifically, hidden fences thanks to volatile are no longer. Those who cannot take it should fall into the 99% bucket already mentioned above, versus the 1% bucket.
Let’s set #1 aside for now, since it’s obviously a huge can of worms.
But what about #2? It is quite encouraging that the C++0x group is firmly on the path of #2. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2145.html for more details. In a nutshell, each location that you’d have ordinarily tagged volatile instead becomes a template atomic type. And then each read / write has the opportunity to specify the desired kind of fence, whether that is acquire (for reads), release (for writes), fully ordered, or relaxed (meaning no fence).
I do think it’d be worth them considering compiler-only fences too. So that relaxed means fully-relaxed, and there is a fence in-between that merely prevents the compiler from optimizing the memory operation. This pays homage to volatile’s legacy in C as merely a variable that mustn’t be subject to optimizations, because operations against these variables pertain to, say, memory-mapped device I/O.
Another nitpick of mine is that I’d have required each access to specify the fence, whereas C++0x implicitly uses full fence if left unspecified at the callsite. It’s a minor convenience, but I like always having the fence spelled out very explicitly in the code. Lock-free access to shared variables is sufficiently dangerous that automatic sequential consistency is the least of your problems.
Nevertheless, the C++0x direction is a massive good step forward, and these are just minor details.
My hope is that .NET follows suit. And the timing couldn’t be more apropos as “now”: we are moving forward in a heavily mobile, distributed-system-on-a-chip, and heterogeneous world, where processor memory models will necessarily continue to weaken. The overly strong x86 memory model, kept alive primarily to ensure compatibility, has simply grown too expensive to accommodate. The power benefits and architectural simplifications are hard to argue with, and because compatibility becomes less of an issue as new platforms arise (e.g. for mobile), the world moves to the cloud, and hence there is legacy to worry about, I do hope that processor vendors seize the opportunity. ARM certainly has. It is less about out-of-order execution as it is about coherency costs. Truthfully, I’d be disappointed if anything else happened, even though the risk to compatibility for shrink-wrapped software scares the hell out of me. But this is most certainly the right way to go, long-term. As software platforms move in the direction of #1 – as I also foresee – the need for fences dwindles. The cost of supporting the current .NET memory model is too great and will become a liability with time.
Thankfully, it is quite simple to build a veneer atop .NET that works a lot more like atomic. For example, imagine that we had a new System.Threading.Volatile static class, and that it offered the moral equivalent to atomic inner types for each atomic primitive we can synchronize against:
namespace System.Threading
{
public static class Volatile
{
public struct Int32 {..}
public struct Int64 {..}
…
public struct Reference where T : class {..}
…
}
}
Now instead of tagging a location as ‘volatile’, you would use one of these primitives. For example, rather than:
static volatile MySingleton s_instance;
You would say:
static Volatile.Reference<MySingleton> s_instance;
Each class has a similar set of operations. For example:
namespace System.Threading
{
public static class Volatile
{
public struct Int32
{
public Int32(int value);
public int ReadAcquireFence();
public int ReadFullFence();
public int ReadCompilerOnlyFence();
public int ReadUnfenced();
public void WriteReleaseFence(int newValue);
public void WriteFullFence(int newValue);
public void WriteCompilerOnlyFence(int newValue);
public void WriteUnfenced(int newValue);
public int AtomicCompareExchange(int newValue, int comparand);
public int AtomicExchange(int newValue);
public int AtomicAdd(int delta);
public int AtomicIncrement();
public int AtomicDecrement();
// Etc…, bitwise ops, other math ops, etc.
}
}
}
Of course, only the integer types would offer the increment, decrement, add, and related operators. And it turns out that offering different kinds of fences on the Atomic* operations would be incredibly useful too, because processors like ARM do not couple the fence to the compare-and-swap / load-locked-store-conditional as x86 processors do. Taking advantage of this can be huge if you are writing performance critical code, like a concurrent garbage collector whose atomic swaps need not imply ordering with the surrounding instruction stream. You can quibble over the details, like whether these should use enums instead of the name to encode the fence-kind. I did it this way to keep the implementations branch-free, although with a decent inlining JIT compiler, it’d probably optimize those away thanks to constant propagation.
It’s quite trivial to implement these APIs atop existing .NET primitives. I built a little library that does so, but it was so boring and repetitive I decided not to post it alongside this blog entry as originally intended.
With the above definition, we can very clearly see the fences involved in doing, say, double-checked locking:
static Volatile.Reference<MySingleton> s_instance;
public static MySingleton Instance
{
get
{
MySingleton instance = s_instance.ReadAcquireFence();
if (instance == null) {
instance = new MySingleton();
instance = s_instance.AtomicCompareExchangeRelease(instance, null);
}
return instance;
}
}
We see there are two fences. One is an acquire and, depending on what your memory model says about data dependence, is probably unnecessary. Most sane memory models guarantee that data dependent loads do not pass. So we needn’t worry that we’ll see a non-null s_instance whose contents haven’t been initialized. (If we were talking structs, it’d be another story.) Nevertheless, it’s definitely required that we use a release-only fence for the publication of the object. This guarantees writes to fields within the MySingleton constructor have completed prior to the write of the new object to the shared instance field. The point here is that you are forced to think about the fences, and you actually see them.
Of course, most platforms need to provide the bare minimum of fencing to assure type safety, particularly for languages like C#. My understanding is that C++0x has decided, at least for now, not to offer type-safety in the face of multithreading. That means you might publish an object and, if stores occur out-of-order, the reader could see an object partially initialized with an invalid vtable pointer. In C# and Java, the language designers have thankfully decided to shield programmers from this. The need for fences also extends to unsafe code like strings, where – were it possible for a thread to read the non-zero length before the char* pointer was valid – writes to random memory could occur and hence threaten type-safety. Thankfully, again, C# and Java protect developers from this, mostly due to the automatic zero’ing of memory as the GC allocates and hands out new segments.
There are costs to offering this type safety assurance. So you can understand why the C++ designers want to keep fences out of object allocation. If you have #1 above, however, the costs are dramatically lower and more acceptable. But the world is – unfortunately – still a freethreaded one, and we have several years to go before we’ve reached the final destination. As a step forward, however, the death of volatile is a welcomed one. Say it with me.
“Sayonara volatile.”
Here’s hoping that .NET 5.0 takes this step forward too.
 Sunday, October 31, 2010
I rambled on and on a few weeks back about how much performance matters. Although I got a lot of contrary feedback, most of it had to do with my deliberately controversial title than the content of the article. I had intended to post a redux, but something more concise is on my mind lately.
GC-based memory management is a boon to productivity, not to mention program safety. Few would argue with this. However, the most effective developers know how their particular GC works, and optimize their program’s data structure, allocation, and lifetime behavior to suit their particular GC best.
This is dangerous, but a pragmatic fact of life. It is dangerous because who’s to say that the runtime team doesn’t intend to entirely revamp the GC’s collection strategy next release, at which point your thoughtfulness may actually harm you? It’s a pragmatic fact for a few reasons: it’s probably not likely that the behavior of your favorite GC is going to change too fundamentally over time; if it did, you’d need to rethink things anyhow; oh, and when was that next release anyway (2 or more years out); and finally, what do you care about more, the theoretical loose coupling, or real results today?
One of the worst data structures for traversal is a linked list. That’s because its contents are fetched by pointer-chasing, an act that usually destroys locality, unless the data associated with each pointer was carefully constructed to live next to its previous and next pointer’s data. This seldom happens, because the main point of a linked list is to free you from such constraints.
One of the best data structures for traversal, on the other hand, is an array. Adjacent elements are truly adjacent to one another in memory, meaning that as you fetch the i’th element from memory, you’re probably pulling in the i’th, i+1’th, …, and so on, thanks to spatial locality. Of course, if the elements are just pointers, then you're back to the chasing game; as with anything, it depends.
How many elements you prefetch of course depends on the size of the elements with respect to your processor’s cache line. If you’re working with 8-byte elements, and 128-byte cache lines, then you may pay 100 cycles for the first fetch, and then amortize that cost over the subsequent 15 found cheaply in cache for 10 cycles. The result is about 250 cycles total; compare this to a linked list, where you’ll probably spend 1,600 cycles, or more than 6X the cost. And of course you’re trashing other data in the cache in the process. As you traverse more and more of the list, the numbers snowball, and the amortization of locality, or lack thereof, provides a stark contrast.
There’s another subtle reason why this is important. Stop and think about what happens when a GC occurs.
Yep, that’s right. Your data structures need to be traversed during a GC, after all, to ascertain the liveness of any pointers held within. That scan looks a whole heck of a lot like the same traversal I just described, and enjoys the same locality properties. So we can immediately conclude that data structures whose traversal is efficient will translate into less time spent in the GC chasing pointers, and better cache efficiency.
For programs that are sensitive to long pause times, this is huge. I talk to customers all the time whose programs are sensitive to microsecond-long GC delays, and – aside from ensuring good GC lifetime practices, like ensuring all objects either die young or live long – being conscious about locality can be immensely important. Especially for any long-lived, large data structures that will be subject to Gen2 collections throughout their lifetime.
There is another useful trick to know. If a data structure contains no pointers, the GC will not have to trace these pointers. Obvious, right? A linked list inherently contains pointers, so this trick really doesn’t apply: the GC will need to traverse the whole live portions of the object graph. What’s interesting is that an array, on the other hand, may or may not contain elements that contain pointers. For example, an array of ints clearly has no pointers, whereas an array of string references clearly does. This doesn’t just apply to the primitive types, but also custom structs which may or may not contain references. When the GC encounters such an array, its contents need not be traversed: instead, the array is alive, and that's that. Yet another opportunity to eliminate pointer chasing. Not only does this save the GC from doing some heavy lifting, but the pointer-free structs eliminate the need for the GC write barriers on array stores too.
So think of all this next time you’re confronted with the decision to employ a tree, graph, or linked list, and whether a dense, and perhaps pointer-free, representation could be beneficial. Even if it means you must replace pointers with index calculations. The locality benefits may not matter, but then again, they may. And at least you can knowingly make a balanced tradeoff, with these potential advantages in mind.
 Saturday, September 18, 2010
I have several positions open on my team here at Microsoft.
My team's responsibility spans multiple aspects of a research operating system’s programming model. The three main areas are concurrency, languages, and frameworks. When I say concurrency, I mean things like asynchrony and message passing, data and task parallelism, distributed parallelism, runtime scheduling and resource management, and heterogeneity and GPGPU. When I say languages, I mean type systems, mostly-functional programming, verified safe concurrency, and both front- and back-end compilation. And when I say frameworks, I mean virtually anything you could imagine wanting out of a platform framework: all things XML, data interoperability (database, web services, etc.), collections, transactions, multimaster synchronization, and even low level things, like regex, numerics, and globalization.
Our team is 100% developers, and we have an “everybody codes, everybody loves to code” culture. Even managers are expected to spend a significant amount of time prolifically writing code.
All of these components are new and built from the ground up. So self-drive and an ability to have a vision and make it happen are incredibly important.
We are always happy to hire great, hard-working people, regardless of years of experience. If you’re extremely strong in one or more of the abovementioned areas, more of a generalist, are an amazing coder, or all of the above, you’d fit in perfectly. This is the most amazing team of people I’ve ever worked with. If you are interested, please email your resume to me at joedu AT microsoft DOT com.
 Monday, September 06, 2010
I can’t tell you how many times I’ve heard the age-old adage echoed inappropriately and out of context:
"We should forget about small efficiencies, say about 97% of the time; premature optimization is the root of all evil" -- Donald E. Knuth, Structured Programming with go to Statements
I have heard the "premature optimization is the root of all evil" statement used by programmers of varying experience at every stage of the software lifecycle, to defend all sorts of choices, ranging from poor architectures, to gratuitous memory allocations, to inappropriate choices of data structures and algorithms, to complete disregard for variable latency in latency-sensitive situations, among others.
Mostly this quip is used defend sloppy decision-making, or to justify the indefinite deferral of decision-making. In other words, laziness. It is safe to say that the very mention of this oft-misquoted phrase causes an immediate visceral reaction to commence within me... and it’s not a pleasant one.
In this short article, we’ll look at some important principles that are counter to what many people erroneously believe this statement to be saying. To save you time and suspense, I will summarize the main conclusions: I do not advocate contorting oneself in order to achieve a perceived minor performance gain. Even the best performance architects, when following their intuition, are wrong 9 times out of 10 about what matters. (Or maybe 97 times out of 100, based on Knuth’s numbers.) What I do advocate is thoughtful and intentional performance tradeoffs being made as every line of code is written. Always understand the order of magnitude that matters, why it matters, and where it matters. And measure regularly! I am a big believer in statistics, so if a programmer sitting in his or her office writing code thinks just a little bit more about the performance implications of every line of code that is written, he or she will save an entire team that time and then some down the road. Given the choice between two ways of writing a line of code, both with similar readability, writability, and maintainability properties, and yet interestingly different performance profiles, don’t be a bozo: choose the performant approach. Eschew redundant work, and poorly written code. And lastly, avoid gratuitously abstract, generalized, and allocation-heavy code, when slimmer, more precise code will do the trick.
Follow these suggestions and your code will just about always win in both maintainability and performance.
Understand the order of magnitude that matters
First and foremost, you really ought to understand what order of magnitude matters for each line of code you write.
In other words, you need to have a budget; what can you afford, and where can you afford it? The answer here changes dramatically depending on whether you’re writing a device driver, reusable framework library, UI control, highly-connected network application, installation script, etc. No single answer fits all.
I am personally used to writing code where 100 CPU cycles matters. So invoking a function that acquires a lock by way of a shared-memory interlocked instruction that may take 100 cycles is something I am apt to think hard about; even more worrisome is if that acquisition could block waiting for 100,000 cycles. Indeed this situation could become disastrous under load. As you can tell, I write a lot of systems code. If you’re working on a network-intensive application, on the other hand, most of the code you write is going to be impervious to 100 cycle blips, and more sensitive to efficient network utilization, scalability, and end-to-end performance. And if you’re writing a little one-time script, or some testing or debugging program, you may get away with ignoring performance altogether, even multi-million cycle network round-trips.
To be successful at this, you’ll need to know what things cost. If you don’t know what things cost, you’re just flailing in the dark, hoping to get lucky. This includes rule of thumb order of magnitudes for primitive operations – e.g. reading / writing a register (nanoseconds, single-digit cycles), a cache hit (nanoseconds, tens of cycles), a cache miss to main memory (nanoseconds, hundreds of cycles), a disk access including page faults (micro- or milliseconds, millions of cycles), and a network roundtrip (milliseconds or seconds, many millions of cycles) – in addition to peering beneath opaque abstractions provided by other programmers, to understand their best, average, and worst case performance.
Clearly the concerns and situations you must work to avoid change quite substantially depending on the class of code you are writing, and whether the main function of your program is delivering a user experience (where usability reigns supreme), delivering server-side throughput, etc. Thinking this through is crucial, because it helps avoid true "premature optimization" traps where a programmer ends up writing complicated and convoluted code to save 10 cycles, when he or she really needs to be thinking about architecting the interaction with the network more thoughtfully to asynchronously overlap round-trips. Understanding how performance impacts the main function of your program drives all else.
Pay attention to interoperability between layers of separately authored software that is composed together. The most common cause of hangs is that an API didn’t specify the expected performance, and so a UI programmer ended up using it in an innocuous but inappropriate way, because they couldn’t afford the range of order of magnitude cost that the API’s performance was expected to fall within. Hangs aren’t the only manifestation; O(N^2), or worse, performance can also result, if, for example, a caller didn’t realize the function called was going to enumerate a list in order to generate its results.
It is also important to think about worst case situations. What happens if that lock is held for longer than expected, because the system is under load and the scheduler is overloaded? And what if the owning thread was preempted while holding the lock, and now will not get to run again for quite some time? What happens if the network is saturated because a big news event is underway, or worse, the phone network is intermittently cutting out, the network cable has been unplugged, etc.? What about the case where, because a user has launched far too many applications at once, your memory-intensive operation that usually enjoys nice warmth and locality suddenly begins waiting for the disk on the majority of its memory accesses, due to demand paging? These things happen all the time.
In each of these situations, you can end up paying many more orders of magnitude in cost than you expected under ordinary circumstances. The lock acquisition that usually took 100 CPU cycles now takes several million cycles (as long as a network roundtrip), and the network operation that is usually measured in milliseconds is now measured in tens of seconds, as the software painfully waits for the operation to time out. And your "non-blocking" memory-intensive algorithm on the UI thread just caused a hang, because it’s paging like crazy.
You’ve experienced these problems as a user of modern software, I am sure, and it isn’t fun. An hourglass, spinning donut, unresponsive button clicks, "(Not Responsive)" title bars, and bleachy white screens. An important measurement of a programmer’s worth is how good the code they write operates under the extreme and rare circumstances. Because, truth be told, when you have a large user-base, these circumstances aren’t that rare after all. This is more of a "performance in the large" thing, but it turns out that the end result is delivered as a result of many "performance in the small" decisions adding up. A developer wrote code meant to be used in a particular way, but decided what order of magnitude was reasonable based on best case, … and gave no thought to the worst case.
Using the right data structure for the job
This is motherhood and apple pie, Computer Science 101, … bad clichés abound. And yet so many programmers get this wrong, because they simply don’t give it enough thought.
One of my favorite books on the topic ("Programming Pearls") has this to say about them:
"Most programmers have seen them, and most good programmers realize they’ve written at least one. They are huge, messy, ugly programs that should have been short, clean, beautiful programs."
I’ll add one adjective to the "short, clean, beautiful" list: fast.
Data structures drive storage and access behavior, both strongly affecting the size and speed of algorithms and components that make use of them. Worst case really does matter. This too is a case where the right choice will boost not only performance but also the cleanliness of the program.
I’m actually not going to spend too much time on this; when I said this is CS101, I meant it. However, it is crucial to be intentional and smart in this choice. Validate assumptions, and measure.
Ironically, in my experience, many senior programmers can make frighteningly bad data structure choices, often because they are more apt to choose a sophisticated and yet woefully inappropriate one. They may choose a linked list, for example, because they want zero-allocation element linking via an embedded next pointer. And yet they then end up with many lists traversals throughout the program, where a dense array representation would have been well worth the extra allocation. The naïve programmer would have happily new’d up a List<T>, and avoided some common pitfalls; yet, here the senior guy is working as hard as humanly possible to avoid a single extra allocation. They over-optimized in one dimension, and ignored many others that mattered more.
This same class of programmer may choose a very complicated lock-free data structure for sharing elements between threads, incurring many more object allocations (and thus increased GC pressure), and a large number of expensive interlocked operations scattered throughout the code. The sexy lure of lock-freedom tricked them into making a bad choice. Perhaps they didn’t quite understand that locks and lock-free data structures share many costs in common. Or perhaps they just hoped to get lucky and squeeze out out-of-this-world scalability thanks to lock-freedom, without actually considering the access patterns necessary to lead to such scalability and whether their program employed them.
These are often held up as examples of "premature optimization", but I hold them up as examples of "careless optimization". The double kicker here is that the time spent building the more complicated solution would have been better spent carefully thinking and measuring, and ultimately deciding not to be overly clever in the first place. This most often plagues the mid-range programmer, who is just smart enough to know about a vast array of techniques, but not yet mature enough to know when not to employ them.
A different, better-performing approach
It’s an all-too-common occurrence. I’ll give code review feedback, asking "Why didn’t you take approach B? It seems to be just as clear, and yet obviously has superior performance." Again, this is in a circumstance where I believe the difference matters, given the order of magnitude that matters for the code in question. And I’ll get a response, "Premature optimization is the root of all evil." At which point I want to smack this certain someone upside the head, because it’s such a lame answer.
The real answer is that the programmer didn’t stop to carefully consider alternatives before coding up solution A. (To be fair, sometimes good solutions evade the best of us.) The reality is that the alternative approach should have been taken; it may be true that it’s "too late" because the implications of the original decision were nontrivial and perhaps far-reaching, but that is too often an unfortunate consequence of not taking the care and thought to do it right in the first place.
These kinds of "peanut butter" problems add up in a hard to identify way. Your performance profiler may not obviously point out the effect of such a bad choice so that it’s staring you in your face. Rather than making one routine 1000% slower, you may have made your entire program 3% slower. Make enough of these sorts of decisions, and you will have dug yourself a hole deep enough to take a considerable percentage of the original development time just digging out. I don’t know about you, but I prefer to clean my house incrementally and regularly rather than letting garbage pile up to the ceilings, with flies buzzing around, before taking out the trash. Simply put, all great software programmers I know are proactive in writing clear, clean, and smart code. They pour love into the code they write.
In this day and age, where mobility and therefore power is king, instructions matter. My boss is fond of saying "the most performant instruction is the one you didn’t have to execute." And it’s true. The best way to save battery power on mobile phones is to execute less code to get the same job done.
To take an example of a technology that I am quite supportive of, but that makes writing inefficient code very easy, let’s look at LINQ-to-Objects. Quick, how many inefficiencies are introduced by this code?
int[] Scale(int[] inputs, int lo, int hi, int c) {
var results = from x in inputs
where (x >= lo) && (x <= hi)
select (x * c);
return results.ToArray();
}
It’s hard to account for them all.
There are two delegate object allocations, one for the call to Enumerable.Where and the other for the call to Enumerable.Select. These delegates point to potentially two distinct closure objects, each of which has captured enclosing variables. These closure objects are instances of new classes, which occupy nontrivial space in both the binary and at runtime. (And of course, the arguments are now stored in two places, must be copied to the closure objects, and then we must incur extra indirections each time we access them.) In all likelihood, the Where and Select operators are going to allocate new IEnumerable and new IEnumerator objects. For each element in the input, the Where operator will make two interface method calls, one to IEnumerator.MoveNext and the other to IEnumerator.get_Current. It will then make a delegate call, which is slightly more expensive than a virtual method call on the CLR. For each element for which the Where delegate returns ‘true’, the Select operator will have likewise made two interface method calls, in addition to another delegate invocation. Oh, and the implementations of these likely use C# iterators, which produce relatively fat code, and are implemented as state machines which will incur more overhead (switch statements, state variables, etc.) than a hand-written implementation.
Wow. And we aren’t even done yet. The ToArray method doesn’t know the size of the output, so it must make many allocations. It’ll start with 4 elements, and then keep doubling and copying elements as necessary. And we may end up with excess storage. If we end up with 33,000 elements, for example, we will waste about 128KB of dynamic storage (32,000 X 4-byte ints).
A programmer may have written this routine this way because he or she has recently discovered LINQ, or has heard of the benefits of writing code declaratively. And/or he or she may have decided to introduce a more general purpose implementation of a Scale API versus doing something specific to the use in the particular program that Scale will be immediately used in. This is a great example of why premature generalization is often at odds with writing efficient code.
Imagine an alternative universe, on the other hand, where Scale will only get used once and therefore we can take advantage of certain properties of its usage. Namely, perhaps the input array need not be preserved, and instead we can update the elements matching the criteria in place:
void ScaleInPlace(int[] inputs, int lo, int hi, int c) {
for (int i = 0; i < inputs.Length; i++) {
if ((inputs[i] >= lo) & (inputs[i] <= hi)) {
inputs[i] *= c;
}
}
}
A quick-and-dirty benchmark shows this to be an order of magnitude faster. Again, is it an order of magnitude that you care about? Perhaps not. See my earlier thoughts on that particular topic. But if you care about costs in the 100s or 1000s of cycles range, you probably want to pay heed.
Now, I’m not trying to take potshots at LINQ. It was just an example. In fact, I spent 3 years running a team that delivered PLINQ, a parallel execution engine for LINQ-to-Objects. LINQ is great where you can afford it, and/or where the alternatives do not offer ridiculously better performance. For example, if you can’t do in-place updates, functionally producing new data is going to require allocations no matter which way you slice it. But having watched people using PLINQ, I have witnessed numerous occasions where an inordinately expensive query was made 8-times faster by parallelizing… where the trivial refactoring into a slimmed down algorithm with proper data structures would have speed the code up by 100-fold. Parallelizing a piggy piece of code to make it faster merely uses more of the machine to get the same job done, and will negatively impact power, resource management, and utilization.
Another view is that writing code in this declarative manner is better, because it’ll just get faster as the compiler and runtimes enjoy new optimizations. This sounds nice, and seems like taking a high road of some sort. But what usually matters is today: how does the code perform using the latest and greatest technology available today. And if you scratch underneath the surface, it turns out that most of these optimizations are what I call "science fiction" and unlikely to happen. If you write redundant asserts at twenty adjacent layers of your program, well, you’re probably going to pay for them. If you allocate objects like they are cheap apples growing on trees, you’re going to pay for them. True, optimizations might make things faster over time, but usually not in the way you expect and usually not by orders of magnitude unless you are lucky.
A colleague of mine used to call C a WYWIWYG language—"what you write is what you get"—wherein each line of code roughly mapped one-to-one, in a self-evident way, with a corresponding handful of assembly instructions. This is a stark contrast to C#, wherein a single line of code can allocate many objects and have an impact to the surrounding code by introducing numerous silent indirections. For this reason alone, understanding what things cost and paying attention to them is admittedly more difficult – and arguably more important – in C# than it was back in the good ole’ C days. ILDASM is your friend … as is the disassembler. Yes, good systems programmers regularly look at the assembly code generated by the .NET JIT. Don’t assume it generates the code you think it does.
Gratuitous memory allocations
I love C#. I really do. I was reading my "Secure Coding in C and C++" book for fun this weekend, and it reminded me how many of those security vulnerabilities are eliminated by construction thanks to type- and memory-safety.
But the one thing I don’t love is how easy and invisible it makes heap allocations.
The very fact that C++ memory management is painful means most C++ programmers are overly-conscious about allocation and footprint. Having to opt-in to using pointers means developers are conscious about indirections, rather than having them everywhere by default. These are qualitative, hard-to-backup statements, but in my experience they are quite true. It’s also cultural.
Because they are so easy, it’s doubly important to be on the lookout for allocations in C#. Each one adds to a hard-to-quantify debt that must be repaid later on when a GC subsequently scans the heap looking for garbage. An API may appear cheap to invoke, but it may have allocated an inordinate amount of memory whose cost is only paid for down the line. This is certainly not "paying it forward."
It’s never been so easy to read a GB worth of data into memory and then haphazardly leave it hanging around for the GC to take care of at some indeterminate point in the future as it is in .NET. Or an entire list of said data. Too many times a .NET program holds onto GBs of working set, when a more memory conscientious approach would have been to devise an incremental loading strategy, employ a denser representation of this data, or some combination of the two. But, hey, memory is plentiful and cheap! And in the worst case, paging to disk is free! Right? Wrong. Think about the worst case.
Depending on the size of the objects allocated, how long they remain live, and how many processors are being used, the subsequent GCs necessary to clean up incessant allocations may impact a program in a difficult to predict way. Allocating a bunch of very large objects that live for long enough to make it out of the nursery, but not forever, for instance, is one of the worst evils you can do. This is known as mid-life crisis. You either want really short-lived objects or really long-lived ones. But in any case, it really matters: the LINQ example earlier shows how easy it is to allocate crap without seeing it in the code.
If I could do it all over again, I would make some changes to C#. I would try to keep pointers, and merely not allow you to free them. Indirections would be explicit. The reference type versus value type distinction would not exist; whether something was a reference would work like C++, i.e. you get to decide. Things get tricky when you start allocating things on the stack, because of the lifetime issues, so we’d probably only support stack allocation for a subset of primitive types. (Or we’d employ conservative escape analysis.) Anyway, the point here is to illustrate that in such a world, you’d be more conscious about how data is laid out in memory, encouraging dense representations over sparse and pointer rich data structures silently spreading all over the place. We don't live in this world, so pretend as though we do; each time you see a reference to an object, think "indirection!" to yourself and react as you would in C++ when you see a pointer dereference.
Allocations are not always bad, of course. It’s easy to become paranoid here. You need to understand your memory manager to know for sure. Most GC-based systems, for example, are heavily tuned to make successive small object allocations very fast. So if you’re programming in C#, you’ll get less "bang for your buck" by fusing a large number of contiguous object allocations into a small number of large object allocations that ultimately occupy an equivalent amount of space, particularly if those objects are short-lived. Lots of little garbage tends to be pretty painless, at least relatively.
Variable latency and asynchrony
There aren’t many ways to introduce a multisecond delay into your program at a moment’s notice. But I/O can do just that.
Code with highly variable latency is dangerous, because it can have dramatically different performance characteristics depending on numerous variables, many of which are driven by environmental conditions outside of your program’s control. As such it is immensely important to document where such variable latency can occur, and to program defensively against it happening.
For example, imagine a team of twenty programmers building some desktop application. The team is just large enough that no one person can understand the full details of how the entire system works. So you’ve got to compose many pieces together. (As I mentioned earlier, composition can lead to hard-to-predict performance characteristics.) Programmer Alice is responsible for serving up a list of fonts, and Programmer Bob is consuming that list to paint it on the UI. Does Bob know what it takes to fetch the list of fonts? Probably not. Does Alice know the full set of concerns that Bob must think about to deliver a responsive UI, like incremental repaints, progress reporting, and cancellation? Probably not. So Alice does the best she knows how to do: she hits the cache, when the font cache is fully populated, and falls back to fetching fonts from the printer otherwise. She returns a List<Font> object from her API. Now Bob just calls the API and paints the results on the UI; the call appears to be quite snappy in his testing. Unfortunately, when the cache is not populated, the "quick" cache hit turns into a series of network roundtrips with the printer; that hangs the UI, but only for a few milliseconds. Even worse, when the printer is accidentally offline, perhaps due to a power outage, the UI freezes for twenty seconds, because that’s the hard-coded timeout. Ouch!
This situation happens all the time. It’s one of the most common causes of UI hangs.
If you’re programming in an environment where asynchrony is first class, Alice could have advertised the fact that, under worst-case circumstances, fetching the font list would take some time. If she were programming in .NET, for example, she’d return a Task<List<Font>> rather than a List<Font>. The API would then be self-documenting, and Bob would know that waiting for the task’s result is dangerous business. He knows, of course, that blocking the UI thread often leads to responsiveness problems. So he would instead use the ContinueWith API to rendezvous with the results once they become available. And Bob may now know he needs to go back and work more closely with Alice on this interface: to ensure cancellation is wired up correctly, and to design a richer interface that facilitates incremental painting and progress reporting.
Variable latency is not just problematic for responsiveness reasons. If I/O is expressed synchronously, a program cannot efficiently overlap many of them. Imagine we must make three network calls as part of completing an operation, and that each one will take 25 milliseconds to complete. If we do synchronous I/O, the whole operation will take at least 75 milliseconds. If we launch do asynchronous I/O, on the other hand, the operation may take as few as 25 milliseconds to finish. That’s huge.
If I had my druthers, all I/O would be asynchronous. But that’s not where we are today.
The concern is not limited to just I/O, of course. Compute- and memory-bound work can quickly turn into variable latency work, particularly under stressful situations like when an application is paging. So truthfully any abstraction doing "heavy lifting" should offer an asynchronous alternative.
Examples of "bad" optimizations
It is easy to take it too far. Even if you’re shaving off cycles where each-and-every cycle matters, you may be doing the wrong thing. If anything, I hope this article has convinced you to be thoughtful, and to strive to strike a healthy balance between beautiful code and performance.
Anytime the optimization sacrifices maintainability, it is highly suspect. Indeed, many such optimizations are superficial and may not actually improve the resulting code’s performance.
The worst category of optimization is one that can lead to brittle and insecure code.
One example of this is heavy reliance on stack allocation. In C and C++, doing stack allocation of buffers often leads to difficult choices, like fixing the buffer size and writing to it in place. There is perhaps no single technique that, over the years, has led to the most buffer overrun-based exploits. Not only that, but stack overflow in Windows programs is quite catastrophic, and increases in likelihood the more stack allocation that a program does. So doing _alloca in C++ or stackalloc in C# is really just playing with fire, particularly for dynamically sized and potentially big allocations.
Another example is using unsafe code in C#. I can’t tell you how many times I’ve seen programmers employ unsafe pointer arithmetic to avoid the automatic bounds checking generated by the CLR JIT compiler. It is true that in some circumstances this can be a win. But it is also true that most programmers who do this never bothered to crack open the resulting assembly to see that the JIT compiler does a fairly decent job at automatic bounds check hoisting. This is an example where the cost of the optimization outweighs the benefits in most circumstances. The cost to pin memory, the risk of heap corruption due to a failure to properly pin memory or an offset error, and the complication in the code, are all just not worth it. Unless you really have actually measured and found the routine to be a problem.
If it smells way too complicated, it probably is.
In conclusion
I’m not saying Knuth didn’t have a good point. He did. But the "premature optimization is the root of all evil" pop-culture and witty statement is not a license to ignore performance altogether. It’s not justification to be sloppy about writing code. Proactive attention to performance is priceless, especially for certain kinds of product- and systems-level software.
My hope is that this article has helped to instill a better sense, or reinforce an existing good sense, for what matters, where it matters, and why it matters when it comes to performance. Before you write a line of code, you really need to have a budget for what you can afford; and, as you write your code, you must know what that code costs, and keep a constant mental tally of how much of that budget has been spent. Don’t exceed the budget, and most certainly do not ignore it and just hope that wishful thinking will save your behind. Building up this debt will cost you down the road, I promise. And ultimately, test-driven development works for performance too; you will at least know right away once you have exceeded your budget.
Think about worst case performance. It’s not the only thing that matters, but particularly when the best and worst case differ by an order of magnitude, you will probably need to think more holistically about the composition of caller and callee while building a larger program out of constituent parts.
And lastly, the productivity and safety gains of managed code, thanks to nice abstractions, type- and memory-safety, and automatic memory management, do not have to come at the expense of performance. Indeed this is a stereotype that performance conscious programmers are in a position to break down. All you need to do is slow down and be thoughtful about each and every line of code you write. Remember, programming is as much engineering as it is an art. Measure, measure, measure; and, of course, be thoughtful, intentional, and responsible in crafting beautiful and performant code.
 Sunday, July 11, 2010
That immutability facilitates increased degrees of concurrency is an oft-cited dictum. But is it true? And either way, why?
My view on this matter may be a controversial one. Immutability is an important foundational tool in the toolkit for building concurrent – in addition to reliable and predictable – software. But it is not the only one that matters. Making all your data immutable isn’t going to instantly lead to a massively scalable program. Natural isolation is also critically important, perhaps more so. And, as it turns out, sometimes mutability is just what the doctor ordered, as with large-scale data parallelism.
Isolation first; immutability second; synchronization last
Stepping back for a moment, the recipe for concurrency is rather simple. Say you’ve got multiple concurrent pieces of work running simultaneously (or have a goal of getting there); for discussion’s sake, call them tasks. Take two tasks. The first critical decision has two cases: either these tasks concurrently access overlapping data in shared-memory, or they do not. If they do not, they are isolated, and no precautions associated with racing memory updates are needed. If they do share data, on the other hand, then something else must give. If all concurrently accessed data is immutable, or all functions used to interact with data are pure, then dangerous concurrency hazards are avoided. All is well. If some data is mutable, however, then this is where things get tricky, and higher-level synchronization is needed to make accesses safe. This decision tree is straightforward and clear.
I have listed those four attributes – isolated, immutable and pure, and synchronization – in a very intentional order. Thankfully, this order mirrors the natural top-down hierarchical architecture of most modern object- and component-based programs: we have large containers that communicate through well-defined interfaces, each comprised of layers of such containers, and somewhere towards the leaves, a fair amount of intimate commingling of knowledge regarding data and invariants.
This order also reflects the order of complexity and execution-time costs, from least to most. Isolation is simple, because components depend on each other in loosely-coupled ways, and in fact scales superiorly in a concurrent program because no synchronization is necessary, the “right” data structure may be chosen for the job – immutable or not – and locality is part-and-parcel to the architecture. Immutability at least avoids the morass of synchronization, which can affect programs immensely in complexity, runtime overheads, and write-contention for shared data. It is clear that synchronization is something to avoid at all costs, particularly anything done in an ad-hoc manner like locks.
Making the concurrency
But where did all this concurrency come from, anyway?
It came from two things:
- The coarse-grained breakdown of a program into isolated pieces.
- The fine-grained data parallelism.
On #1: Program fragments that are isolated are already half-way down the road to running concurrently as tasks. The second half of this journey, of course, is teaching them to interact with one another asynchronously, most frequently through message-passing or by sticking them into a pipeline. The details of course depend on what programming language you are using. It may be through agents, actors, active objects, COM objects, EJBs, CCR receivers, web-services, something ad-hoc built with .NET tasks, or some other reification. Nevertheless the isolation is common to all these.
On #2: Data parallelism, it turns out, often works best with mutable data structures. These structures must be partitionable, of course, so that tasks comprising the data parallel operations may operate with logically isolated chunks of this data safely, even if they are parts of the same physical data structure. So chunks of them are isolated even though they don’t appear to be. This is trivially achievable with many important parallel-friendly data structures like arrays, vectors, and matrices. Capturing this isolation in the type system is of course no small task, though region typing gets close (see UIUC’s Data Parallel Java).
But you usually don’t want these structures to be immutable, because they can be modified in constant-time and space if they are their classic simple mutable forms. Programmers doing HPC-style data-parallelism a la FORTRAN, vectorization, and GPGPU know this quite well. Compare this a world where we are doing data-parallelism over immutable data structures, where modifications often necessitate allocations or more complicated big-oh times due to clever techniques meant to avoid such allocations, as with persistent immutable data structures. This is likely less ideal. It is true that some data parallel operations are not in-place against mutable data – as with PLINQ – at which point purity, but not immutability, is key. The two are related but not identical: immutability pervades the construction of data structures, whereas purity pervades the construction of functions. But if you can get by with one copy of the data, why not do it? Particularly since most datasets amenable to parallel speedups are quite large.
Immutability: the bricks, not the mortar
Notice that the concurrency did not actually come from immutable data structures in either case, however. So what are they good for?
One obvious use, which has little to do with concurrency, is to enforce characteristics of particular data structures in a program. A translation lookup table may not have been meant to be written to except for initialization time, and using an immutable data structure is a wonderful way to enforce this intent.
What about concurrency? Immutable data structures facilitate sharing data amongst otherwise isolated tasks in an efficient zero-copy manner. No synchronization necessary. This is the real payoff.
For example, say we’ve got a document-editor and would like to launch a background task that does spellchecking in parallel. How will the spellchecker concurrently access the document, given that the user may continue editing it simultaneously? Likely we will use an immutable data structure to hold some interesting document state, such as storing text in a piece-table. OneNote, Visual Studio, and many other document-editors use this technique. This is zero-cost snapshot isolation.
Not having immutability in this particular scenario is immensely painful. Isolation won’t work very well. You could model the document as a task, and require the spellchecker to interact with it using messages. Chattiness would be a concern. And, worse, the spellchecker’s messages may now interleave with other messages, like a user editing the document. Those kinds of message-passing races are non-trivial to deal with. Synchronization won’t work well either. Clearly we don’t want to lock the user out of editing his or her document just because spellchecking is occurring. Such a boneheaded design is what leads to spinning donuts, bleached-white screens, and “(Not Responding)” title bars. But clearly we don’t want to acquire a lock and then make a full copy of the entire document. Perhaps we’d try to copy just what is visible on the screen. This is a dangerous game to play.
Immutability does not solve all of the problems in this scenario, however. Snapshots of any kind lead to a subtle issue that is familiar to those with experience doing multimaster, in which multiple parties have conflicting views on what “the” data ought to be, and in which these views must be reconciled.
In this particular case, the spellchecker sends the results back to the task which spawned it, and presumably owns the document, when it has finished checking some portion of the document. Because the spellchecker was working with an immutable snapshot, however, its answer may now be out-of-date. We have turned the need to deal with message-level interleaving – as described above – into the need to deal with all of the messages that may have interleaved within a window of time. This is where multimaster techniques, such as diffing and merging come into play. Other techniques can be used, of course, like cancelling and ignoring out-of-date results. But it is clear something intentional must be done.
In conclusion
It is safe to say that immutability facilitates important concurrent architectures and algorithms. It can really help big time, for sure. But it is clearly no panacea. Whether mutability or immutability is the right choice for a particular data structure in your program, as with all things, depends.
It could be the case that choosing a piece-table for storing your text facilitates large-scales of concurrency in version two of your software application, but that in version one you have no use for it. Making that call ahead of time may pay in spades down the road, even if it comes at a marginal cost up-front. Or it could be that choosing an immutable data structure costs you in time and space, and you never end up exploiting the fact that you could have shared that particular structure in a zero-cost way across agents in your program.
One thing’s for sure: I’m glad to be programming in languages like C#, F#, Clojure, and Scala, where I’ve got a choice.
 Thursday, July 01, 2010
In .NET today, readonly/initonly-ness is in the eye of the provider. Not the beholder.
Although both C# and the CLR verifier go to great pains to ensure you don't change a readonly/initonly field outside of its constructor (or class constructor, in the case of a static field), this guarantee doesn't imply what you might imagine. It means what it says: you can't change such fields except for in certain contexts.
If you try, C# won't let you, including forming byrefs to them:
v.cs(0,0): error CS0191: A readonly field cannot be assigned to (except in a constructor or a variable initializer)
v.cs(0,0): error CS0192: A readonly field cannot be passed ref or out (except in a constructor)
v.cs(0,0): error CS0198: A static readonly field cannot be assigned to (except in a static constructor or a variable initializer)
v.cs(0,0): error CS0199: A static readonly field cannot be passed ref or out (except in a static constructor)
And neither will the CLR verifier:
[IL]: Error: [c:\v.exe : C::Main][offset 0x00000001] Cannot change initonly
field outside its .ctor.
Of course, attempting to invoke an operation on a readonly struct will make a defensive copy locally, and invoke the method against that. This ensures the readonly contents cannot change.
One unfortunate hole in this safety is with unions. You do not need unsafe code to break readonly, and yet the effect is the same as with an unverifiable program that writes to a readonly field:
struct S1 {
public readonly int X;
}
struct S2 {
public int X;
}
[StructLayout(LayoutKind.Explicit)]
struct S3 {
[FieldOffset(0)]
public S1 A;
[FieldOffset(0)]
public S2 B;
}
Now we can change A.X via B.X, even though A.X is supposedly readonly:
S3 s3 = ...;
int x = s3.A.X;
s3.B.X++;
ASSERT(x == s3.A.X); // false; it is +1
The same would have been true even if the field S3.A was marked readonly.
This is quite an evil trick. I have to be honest that I believe this is a CIL verification hole, and should produce unverifiable MSIL much like when you try to overlay structs containing overlapping GC references. Nevertheless, it is what it is.
Let's step back. Why does all of this matter, anyway, and what guarantees were we hoping that readonly would provide?
It would be ideal, I assert, if the guarantee was not just "the target field can only be written to in the constructor", but also "the target field, once read, cannot be observed with a different value later on". This would not be true during construction, but we'd like to say it holds at all other times.
The above example throws a wrench in this idea. As does the following example. But this new example will be more disturbing, because the solution is not a simple verifier change.
What would you expect this program to print to the console?
struct S {
public readonly int X;
public S(int x) { X = x; }
public void MultiplyInto(int c, out S target) {
System.Console.WriteLine(this.X);
target = new S(X * c);
System.Console.WriteLine(this.X); // same? it is, after all, readonly.
}
}
S s = new S(42);
s.MultiplyInto(10, out s);
As you may or may not have guessed, the output is "42" followed by "420". Yes, the value of 'this.X' changes after we have assigned to 'target' inside MultiplyTo, because the caller aliases the out-param with the 'this' param. Recall that parameter passing for structs in C# is done byref, so that these two references actually physically point to the same location when that call is made. The assignment to 'target', therefore, actually replaces the entire contents of 'this' all at once. And hence this gives the illusion that readonly fields are shifting.
You might be tempted to say that this can be prevented with alias analysis. But this is deceptively difficult to do. Consider this more complicated example:
class C {
public struct S S;
}
void M1(C c) {
M2(c, out c.S);
}
void M2(C c, out S s) {
c.S.MultiplyInto(10, out s);
}
It is in no way clear inside M2 that the two aliases refer to the same location. The aliasing occurred higher up in the stack. Although byrefs are restricted to stack-only passing, making the necessary alias analysis tantalizingly close to attainable, it is nontrivial to say the least. Presumably we would have had to have blocked the forming of the byref within M1, rather than its use within M2. We could fall back to runtime checks, but that is also unfortunate for numerous reasons.
The moral of the story? Structs as containers of readonly values are not to be trusted, at least not for situations that call for bulletproof safety, such as caching values in the compiler rather than rereading them, because the fields are readonly. Although C# and the CLR do a good job at verifying readonly/initonly are done right at the initialization site, there are still places where these guarantees break down. Thankfully the byref aliasing problem does not threaten thread-safety, but the union problem does. And in conclusion, I do have to imagine all of this will get fixed somewhere down the road, it's just a matter of when and where.
 Sunday, June 27, 2010
Partially-constructed objects are a constant source of difficulty in object-oriented systems. These are objects whose construction has not completed in its entirety prior to another piece of code trying to use them. Such uses are often error-prone, because the object in question is likely not in a consistent state. Because this situation is comparatively rare in the wild, however, most people (safely) ignore or remain ignorant of the problem. Until one day they get bitten.
Not only are partially-constructed objects a source of consternation for everyday programmers, they are also a challenge for language designers wanting to provide guarantees around invariants, immutability and concurrency-safety, and non-nullability. We shall see examples below why this is true. The world would be better off if partially-constructed objects did not exist. Thankfully there is some interesting prior art that moves us in this direction from which to learn.
Seeing such a beast in the wild
In what situations might you see a partially-constructed object? There are two common ones in C++ and C#:
- ‘this’ is leaked out of a constructor to some code that assumes the object has been initialized.
- A failure partway through an object’s construction leads to its destructor or finalizer running against a partially-constructed object.
In the first case, the rule of thumb is “don’t do that.” This is easier said than done. The second case, on the other hand, is a fact of life, and the rule of thumb is “tread with care, and be intentional.” Let’s examine both more closely.
The evils of leaking ‘this’
Leaking ‘this’ during construction to code that expects to see a fully-initialized object is a terrible practice. Before moving on, it’s important to remember initialization order in C++ and C#: base constructors run first, and then more derived constructors. If I have E subclasses D subclasses C, then constructing an instance of E will run C’s constructor, and then D’s, and then lastly E’s. Destructors in C++, of course, run in the reverse order.
Member initializers, on the other hand, run in different orders in C++ versus C#. In C#, they run from most derived first, to least derived. So in the above situation, E’s initializers run, and then D’s, and then C’s. This happens fully before running the ad-hoc constructor code. In C++, however, member initializers run alongside the ordinary construction process. C’s member initializers run just before C’s ad-hoc construction code, and then D’s, and then E’s. Another difference is that C#’s initializers cannot access ‘this’, whereas C++’s initializers can.
For example, this C# program will print E_init, D_init, C_init, C_ctor, D_ctor, and then E_ctor:
using System;
class C {
int x = M();
public C() {
Console.WriteLine("C_ctor");
}
private static int M() {
Console.WriteLine("C_init");
return 42;
}
}
class D : C {
int x = M();
public D() : base() {
Console.WriteLine("D_ctor");
}
private static int M() {
Console.WriteLine("D_init");
return 42;
}
}
class E : D {
int x = M();
public E() : base() {
Console.WriteLine("E_ctor");
}
private static int M() {
Console.WriteLine("E_init");
return 42;
}
}
class Program {
public static void Main() {
new E();
}
}
And this C++ program will print C_init, C_ctor, D_init, D_ctor, E_init, E_ctor, ~E, ~D, and finally ~C:
#include
using namespace std;
struct C {
int x;
C() : x(M()) { cout << "C_ctor" << endl; }
~C() { cout << "~C" << endl; }
static int M() { cout << "C_init" << endl; return 42; }
};
struct D : C {
int x;
D(): x(M()) { cout << "D_ctor" << endl; }
~D() { cout << "~D" << endl; }
static int M() { cout << "D_init" << endl; return 42; }
};
struct E : D {
int x;
E() : x(M()) { cout << "E_ctor" << endl; }
~E() { cout << "~E" << endl; }
static int M() { cout << "E_init" << endl; return 42; }
};
static void main() {
E e;
}
It’s interesting to note that the CLR allows constructor chaining to happen in any order. The C# compiler emits the calls to base as the first thing a constructor does, but other languages can choose to do differently. The verifier ensures that a call has occurred somewhere in the constructor body before returning.
This IL program, for example, will print E, D, and then C – the reverse of what C# gives us:
.assembly extern mscorlib { }
.assembly ctor { }
.class C {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "C"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void [mscorlib]System.Object::.ctor()
ret
}
}
.class D extends C {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "D"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void C::.ctor()
ret
}
}
.class E extends D {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "E"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void D::.ctor()
ret
}
}
.class Program {
.method public static void Main() cil managed {
.entrypoint
newobj instance void E::.ctor()
pop
ret
}
}
So why is leaking ‘this’ bad, anyway?
Say you’ve decided in the implementation of D’s constructor that you would like to stick ‘this’ into a global hash-map. Doing this sadly means other code could grab the pointer and begin accessing it before E’s constructor has even run. That is a race at-best and a ticking time-bomb in all likelihood. For example:
class C {
public static Dictionary s_globalLookup;
private readonly int m_key;
public C(int key) {
m_key = key;
s_globalLookup.Add(key, this);
}
}
Even though we have taken great care to initialize our readonly field m_key before sticking ‘this’ into a dictionary, any subclasses will not have been initialized at this point. Only if C is sealed can we be assured of this. Another piece of code that grabs the element out of the hashtable and begins calling virtual methods on it, for example, is in a race with the completion of the initialization code for subclasses.
Leaking ‘this’ isn’t always such an explicit act. Merely invoking a virtual method within the constructor may dispatch a more derived class’s override before the more derived class’s constructor has run. And therefore its state is most likely not in a place conducive to correct execution of that override. It is fairly common knowledge that invoking virtual methods during construction is an extraordinarily poor practice, and best avoided.
Failure to construct
Let’s move on to the second issue. Imagine we suffer an exception during construction of an object. Perhaps this is due to a failure to allocate resources, or maybe even due to argument validation. It is clear that a leaked ‘this’ object in such cases will be a problem, because the object will have escaped into the wild even though its initialization failed. Subsequent attempts to use the object will obviously pose problems. What is more subtle is that if the class in question declares a destructor (in C++) or finalizer (in C#), a problem may now be lurking.
Let’s say we have the original situation shown above: C derives from D derives from E. Now say an exception happens within D’s constructor. At this point in time, C’s constructor has run to completion, D’s constructor has run partially up to the point of failure, and E’s has not run at all. (And, of course, in the case of C#, all member-initializers for all classes have actually run.) What happens to the cleanup code?
In C++, only constructors that have run will have their corresponding destructors executed. In the above situation, where C, D, and E each declares a destructor, only C’s will be run during stack unwind. It is imperative, therefore, that D handles failure within its constructor rather than relying on the destructor. For example:
class D : C {
int* m_pBuf1;
int* m_pBuf2;
public:
D() {
m_pBuf1 = … alloc …;
m_pBuf2 = … alloc …;
}
~D() {
if (m_pBuf2) … free …;
if (m_pBuf1) … free …;
}
}
If the allocation destined for m_pBuf2 fails by throwing an exception, the destructor for D will not run, and therefore m_pBuf1 will leak. The C++ solution to this particular example is to use smart pointers and member initializers for the allocations, because successfully initialized members do get destructed, even when the constructor (or indeed one of the member initializers) subsequently fails. This means that destructors for a particular class do not have to tolerate that class’s state not having been fully constructed, because those destructors will never run. Destructors must not, however, invoke virtuals, for two obvious reasons: (a) subclasses may not have ever been initialized, and (b) destructors run in reverse order, and so the destruction code for the subclass will have already run by the time a baseclass’s destructor runs.
In C#, finalizers will run, regardless of whether an object’s constructor ran fully, partway through, or not at all. If the object is allocated – and so long as GC.SuppressFinalize isn’t called on it – the finalizer runs. This distinctly means that C# finalizers must always be resilient to partially-constructed objects (unlike C++ destructors). Thankfully finalizers are a rare bird, and therefore this issue is seldom even noticed by .NET programmers.
This issue does not arise in the case of .NET’s IDisposable pattern. If a constructor throws, the assignment to the target local variable does not occur. And therefore the variable enclosed in, say, a using statement remains null. This means that there is no way to possibly invoke Dispose on the object. Moreover, the allocation in using occurs prior to entering the try/finally block. Hence, you really had better be writing constructors that don’t throw and/or protecting such resources with smart-pointer-like things with finalizers, a la SafeHandle.
Impediments to language support
As if these weren’t sufficient cause for concern, I also mentioned earlier – and somewhat vaguely – that partially-constructed objects interfere with language designers’ efforts to add invariants, immutability and concurrency-safety, and non-nullability to the language. And all of these are important properties to consider in our present age of complexity and concurrency, so this point is worth understanding more deeply. Let’s look at each in-order.
Invariants and safe-points
A partially-constructed object obviously may have broken invariants. By definition, invariants are meant to hold at the end of construction, and so if construction never completes, the rules of engagement are being broken.
Imagine, for example:
class C {
int m_x;
int m_y;
invariant m_x < m_y;
public C(int a) {
m_x = a;
m_y = a + 1;
}
}
It is ordinarily very difficult to ensure that each instruction atomically transitions the state of an object from one invariant safe-point to another. A common technique is to define well-defined points at which invariants must hold. We might model each failure point as one such technique. But even in C#, the above program does not satisfy this constraint, because an overflow exception may be generated at the ‘m_y = a + 1’ line. Or a thread-abort exception may be generated right in the middle of those two functions. Or, if addition were implemented as a JIT helper, an out-of-memory exception could get generated due to failure to JIT the helper function.
In such cases, we’d like to say that the object does not exist. But the sad fact is that the object *does* exist, and if the object has acquired physical resources at the time of failure to construct, we must compensate and reclaim them. The ideal world looks a lot like object construction as transactions, where the end of construction is the commit-point. The state-of-the-art is very different from this, however, and so any static verification and theorem proving that depends on invariants on an object holding, well, invariantly, is subject to being broken by partially-constructed objects.
Immutability… or not
Immutability is also threatened by partially-constructed objects. Immutability is a one of many first class techniques for solving concurrency-safety in the language, so this one is quite unfortunate.
In C#, for example, we might be tempted to say that a shallowly immutable type is one whose fields are all marked readonly. (And a deeply immutable type is one whose fields are all readonly, and also refer to types that are immutable.) A readonly field cannot be written to after construction has completed. Unfortunately, if the ‘this’ leaks out during construction, we may see those readonly fields in an uninitialized or even changing state:
class C {
public static C s_c;
readonly int m_x;
public C() {
m_x = 42;
s_c = this;
while (true) {
++m_x;
}
}
}
This is quite evidently a terrible and malicious program. C appears to be immutable, because it only contains readonly fields, but is quite clearly not, because the value of m_x is assigned to multiple times. If we had a guarantee that all readonly fields were definitely assigned once-and-only-once before ‘this’ can escape, then we’d be close to a solution. But of course we have no such guarantee. In C#, at least.
A related issue is co-initialization of objects. This is interesting and relevant, because in such cases we actually want to leak out partially-constructed objects. Imagine we’d like to build a cyclic graph comprised of two nodes, A and B, each referring to the other. With a naïve approach to immutability, we simply cannot make this work. Either A must first refer to B, in which case A refers to a partially-constructed B; or B must first refer to A, in which case B refers to a partially-constructed A. Both are equally as bad. The two assignments are not atomic.
Cyclic data structures are a commonly cited weakness of immutability, and an argument in favor of supporting partially-instantiated objects in a first class way, although there are approaches that can work. One example is to separate edges from nodes, and represent them with different data structures. We can then build the nodes A and B, and then build the edges A->B and B->A without needing to use cycles.
It’s not-null, or at least it wasn’t supposed to be
Tony Hoare called it his billion-dollar mistake: the introduction of nulls into a programming language. I think he sells himself short, however, because the absence of nulls in an imperative programming language – however worthy a pursuit – is actually a notoriously difficult to attain.
Spec# is one example of a C-style language with non-nullability, in which T! represents a “non-null T”, for any T. Although this is done in a pleasant way conducive to C#-compatibility – a significant goal of Spec# -- I’d personally prefer to see the polarity switched: T would mean “non-null T” and T? would mean “nullable T”, for any T, reference- and value-types included. This is much more like Haskell’s Maybe monad, and is the syntax I’ll use below for illustration purposes. But I digress.
Non-nullability is a wonderful invention, because it is common for 75% or more of the contracts and assertions in modern programs to be about a pointer being non-null prior to dereferencing it, both in C# and in C++. Relying on the type-system to prove the absence of nulls is one big step towards creating programs that are robust and correct out-of-the-gate, particularly for systems software where such degrees of reliability are important. And it cuts down on all those boilerplate contracts sprinkled throughout a program. Instead of:
void M(C c, D d, E e)
requires c != null, d != null, e != null
{
… use(c, d, e) …
}
You simply say:
void M(C c, D d, E e)
{
… use (c, d, e) …
}
No opportunity to miss one, and no need for runtime checks. It’s beautiful.
A problem quickly arises, however, having to do with partially-constructed objects. All of an object’s fields cannot possibly be non-null while the constructor is executing, because the object has been zero’d out and the assignments to its fields have not yet been made. Clearly constructor code needs to be treated “differently” somehow. We cannot simply live with the fact that ‘this’ escaping leads to a partially-constructed object leaking out into the program, because that could lead to serious errors. These serious errors include potential security holes, if unsafe code is manipulating the supposedly non-null pointer. One advantage to adding non-nullability is that runtime null checks can be omitted, because the type system guarantees the absence of nulls in certain places. In this situation, partially-constructed objects lead to holes in our nice type system support. Either the dynamic non-null checks are required as back-stop, or we’ll need some other coping technique.
There are related issues with non-nullability, like with partially-populated arrays. Imagine we’d like to allocate an array of 1M elements of type T, and we will proceed to populate those elements following the array’s allocation. There’s clearly a window of time during which the array contains 1M nulls, and then 1M-1 nulls, …, and if we finish 1M-1M nulls, i.e. 0 nulls. It is only at that last transition that the array can be considered to contain non-null T’s. The standard technique is to use an explicit dynamic conversion check, or to force the creation of the list to supply all of the elements of the array at construction time.
Coping techniques
There are, thankfully, some interesting ways to cope with partially-constructed objects. There is, in fact, a spectrum of techniques, ranging from escape analysis in various forms, to limitations around how objects are constructed such that a partially-constructed one can never leak, to automatic insertion of dynamic checks to prevent the use of partially-constructed objects, to static annotations that treat partially-constructed instances as first class things in the type system. And of course there’s always the technique of “deal with it”, which is the one that most C++-style languages have chosen, including our beloved C#.
The F# approach: restrictions and dynamic checks
F#, it turns out, does a very good job in preventing partially-initialized objects. A first important step is that fields in F# are readonly by-default, unless you opt-in to mutability using the mutable keyword. Therefore data structures are mostly immutable. And the construction rules are meant to make it very unlikely that you’ll expose a partially-constructed object during construction. How so? It’s simple: such fields must be initialized prior to running ad-hoc construction code, and if you attempt to initialize them multiple times, the compiler supplies an error. In other words, you really have to work hard to screw yourself, unlike C++ and C# where it’s very easy.
It’s of course possible to do some dirty tricks to publish or access a partially-initialized object, despite needing to work very hard at it. There is, however, a nice surprise awaiting us when we try. For example:
type C() =
abstract member virt : unit -> unit
default this.virt() = ()
type D() as this =
inherit C()
do this.virt()
type E =
inherit D
val x : int
new() = { x = 42; }
override this.virt() = printf "x: %d" this.x
let e = new E()
This example attempts to perform a virtual invocation from C before the more derived class has been fully initialized. This overridden virtual simply (attempts to) prints out the value of x. If we compile and run this program, however, we will notice that we get an exception: “InvalidOperationException: the initialization of an object or value resulted in an object or value being accessed recursively before it was fully initialized.”
Pretty neat. The compiler will stick in the checks necessary when ‘this’ is being accessed, to dynamically verify that an object is not being leaked before having been fully initialized. The F# approach can be summed up as trying to make things airtight as possible statically at compile-time, but admitting that there are holes – primarily due to inheritance – and dealing with them by inserting dynamic runtime checks. This tradeoff clearly makes sense for F#, because it is attempting to attain a robust level of reliability around immutability.
F# also adds non-nullability for the most part. Like Haskell’s Maybe monad, F# adds an option type that can store a single None value which lies outside of the underlying type’s domain to effectively represent null. Because F# is a .NET language it of course also needs to worry about nulls at interop boundaries with other languages like C# and VB. This is a great step forward; first class CLI support would be a nice next step.
A slight variant of the F# idea is to initialize data up the whole class hierarchy in one pass, and then run ad-hoc construction code in a second pass in the usual way. So long as readonly data can be initialized without running the ad-hoc construction code, this helps to statistically cut down on the chances for accidental leaking of ‘this’.
Type system: T versus notconstructed-T
We can model two kinds of T in the type-system: T and notconstructed-T. The constructor for any type T would then see the ‘this’ pointer as an notconstructed-T, and everything else – by default – sees ordinary T’s.
What good does this distinction do? It enables us to add verification and restrictions around the use of notconstructed-T’s and limitations to the conversion between the two types. See this paper by Manuel Fahndrich and Rustan Leino for an example of how this approach was taken in Spec#’s non-nullability work.
For example, we can prohibit conversion between T and notconstructed-T altogether, thereby disallowing escaping ‘this’ references altogether. If the type of ‘this’ within a constructor is different than all other references to type T, and they are not convertible, we’ve successfully walled the problem off in the type system. This protects against erroneous method calls, so that a constructor could not call any of its own methods, because these methods expect an ordinary T whereas the constructor only has a notconstructed-T. And because you cannot state notconstructed-T in the language, you cannot let one leak by storing it into a field.
We could add more sophisticated support, by allowing a programmer to explicitly tag certain non-field references as T-notconstructed. This makes the concept first-class in the language, and allows one to explicitly declare the fact that code is interacting with a partially-constructed T:
class C
{
int m_x;
public C() {
m_x = V();
}
protected virtual int V() notconstructed {
… I know to be careful …
}
}
In this example, the programmer has annotated V with ‘notconstructed’. This enables the call from the constructor because the method’s ‘this’ is an uninitialized-T, and serves as a warning to the programmer that he or she should take care, much like the code written inside a constructor.
We must also decide whether fields are offlimits via notconstructed-T’s. If yes, we can add F#-style dynamic checks for initialization, but only for attempted accesses against notconstructed object’s fields. This is nice because it means the scope of dynamic checks are limited, and used in a pay-for-play manner. And we could even enable a programmer to sidestep the error by stating at the use-site that they understand a particular field access may be to uninitialized memory, like Field.ReadMaybeUninitialized(&m_field).
To be honest, the reason this approach has likely not yet seen widespread use is that the cost is not commensurate with the benefit. At least, in my opinion. To make something like partially-constructed objects a first-class concept in a programming language, programmers would need to want to know where they are dealing with them. For systems programmers, this makes sense. For many other programmers, it would be useless ceremony with no perceived value. And yet the initial approach where nothing new needed to be stated, but yet escaping ‘this’ was prevented, blocks certain patterns of legal use. This is where theory and practice run up against one another. There is, however, presumably a nice middle ground awaiting discovery.
Winding down
I meant for this to be a short post. But the topic really is fascinating, and has been coming up time and time again as we do language work here at Microsoft. But it is truly fascinating mainly because, like nulls, the problem is widespread yet tolerable, and most C++ and C# programmers learn the rules and make do. Partially-constructed objects are a major blemish, but not a crisis that threatens the complete abandonment of imperative programming.
I do believe language trends indicate that more will move away from C++-style object initialization and related issues, and more towards immutability and treating data and its initialization separately from imperative ad-hoc initialization code. Haskell, F#, and Clojure, for example, show us some promising paths forward. There are a plethora of other attempts at solving related problems, and I unfortunately could only scratch the surface.
Although these techniques are not new, the primary question – to me – is how close to “the metal” in systems programming these concepts can be made to work. Typically for those situations, you need to rely on a mixture of static verification and complete freedom, because the dynamic checking is too costly and the code to work around overly-limiting static verification also adds too much cost. But as soon as you add complete freedom into the picture, you run into partially-constructed objects as a consequence, and all the issues I’ve mentioned above.
Anyway, I hope it was interesting.
 Sunday, May 16, 2010
My article about Transactional Memory (TM) was picked up by a few news feeds recently.
If I had known this would occur, I would have written it with greater precision. Because my article presents a mixture of technical challenges interspersed among more subjective and cultural issues, I am sure it is difficult to tease out my intended conclusion. To summarize, I merely believe adding TM to a shared memory architecture alone is insufficient.
Indeed, I remain a big fan of transactions. Atomicity, consistency, and isolation, and coming up with strategies for achieving all three in tandem, are part and parcel to architecting software.
After watching Barbara Liskov's OOPSLA Turing Award reprise, I decided to reacquaint myself with some old Argus papers this weekend. It has been some time since I last read them. Argus was Liskov's language for distributed programming and her follow-on to CLU. As with most research done by brilliant people, the work was way ahead of its time, has appeared in ad-hoc incarnations and permutations over time, and remains relevant today. This research is particularly interesting to work that my team is doing right now, especially its notion of guardians. And it is relevant to the TM discussion too.
The Argus approach of using isolation to coarsely partition state and operations into independent bubbles, and then communicating asynchronously between the so-called guardians that are responsible for this state, is an architecture that is common among most highly concurrent programs. This aids state management and fault tolerance. Argus makes an interesting observation that, although guardians may be sent messages concurrently -- and indeed activities themselves may even introduce local concurrency -- manipulation of state can be done safely and even in parallel thanks to transactions.
The requirement is that messages are atomic and commute. Transactions, it turns out, are a convenient way of implementing this requirement.
You will observe a similar architecture in other places, including in some languages that have adopted TM. Haskell has moved in this direction. Everything is purely functional and so, of course, no state is mutated in an unsafe way by default. However, with the introduction of concucrrency comes mutable cells for message passing and with parallelism comes indeterminism. You can push the state management problem up indefinitely, but at the top there are almost always mutable operations on real-world state (even if it is "just I/O"). Haskell programs have a safe architecture to begin with, and it is the intentional and careful addition of specific facilities that forces one to focus on the problematic seams. One could say that Haskell starts clean and stays clean, versus most shared memory-based languages which start dirty and try to attain cleanliness (at least when it comes to concurrency).
Why aren't transactions sufficient, then, given that the I in ACID stands for Isolation? You wouldn't model a database as one flat table in which each row is a single byte, however, would you? As you begin to decompose your program into isolated state, your bubbles (or guardians) are the tables, and your objects are the rows. This is just an analogy but I find it useful to think in these terms. Taking a bunch of intermingled state and pouring transactions on top is not going to give you this nicely partitioned separation of state which has proven to be the lifeblood of concurrency.
Even once you've attained a more isolated architecture, however, transactions are not a panacea. They are just one of many viable state management techniques in a programmer's arsenal, hierarchical state machines being another notable example. And in fact, many of the problems I mentioned in the TM article are still worrisome even when you start from the right place. From within a guardian, you may wish to enlist the aid of another unrelated guardian to perform a coordinated atomic activity, because a higher level invariant relationship between them must be preserved. Or an application which composes multiple guardians may wish to do so atomically. Even Argus required manual compensation for such things. This can be solved in part by DTC. But experience suggests that continuing to push the enlistment scope one level higher eventually leads to substantial problems. A topic for another day, I suppose.
My primary conclusion is that TM is a great complement to highly concurrent programs, but only so long as you start from the right place. The Argus and Haskell approaches are conducive to large-scale concurrency, but it is primarily because of the natural isolation those models provide; the addition of transactions address problems that remain after taking that step. But without that first step, they would have gotten nowhere.
 Sunday, April 25, 2010
We use static analysis very heavily in my project here at Microsoft, as a way of finding bugs and/or enforcing policies that would have otherwise gone unenforced. Many of the analyses we rely on are in fact minor extensions to the CLR type system, and verge on “effect typing”, an intriguing branch of type systems research that has matured significantly over the years.
Many of these annotations are done on methods, rather than types. A few examples include:
- [MayBlock] indicates that a method is free to call methods that might block.
- [NoAllocations] indicates that a method is neither allowed to allocate, nor call another method that might allocate
- [Throws(...)] indicates that a method is allowed to throw an exception of a type in the set { … }, or call other methods that may throw exceptions in the set { … }.
- And so on.
It turns out there’s a general way for handling these annotations. And indeed, you will quickly find that pursuing ad-hoc solutions to each independently leads to troubles. We shall briefly look at the generalization.
We must first observe that each falls into one of two categories: additive or subtractive. MayBlock and Throws are additive. They say what is permitted. NoAllocations, on the other hand, is subtractive, because the annotation declares what is not permitted. This distinction, we shall see, is crucial.
First we can imagine that each distinct effect shown above has a distinct effect type.
The types EMayBlock, ENoAllocations, and EThrows correspond to the annotations above. This will permit us to model effects using subtyping polymorphism. We will use the usual notation, i.e. “S <: T” means “S is a subtype of T”, or “a S is substitutable in place of a T”. For example, String <: Object. Throws is special, because it has a type hierarchy of its own beneath the root type. As you might guess, this hierarchy is infinite in size and is comprised of each possible permutation of exception types.
There are two special kinds of effects: the null effect (ENil), and a set of other effects (EMany). The latter permits us to create a new, unique effect type merely by concatenating a list of other effect types.
Each method is then given an EMany effect type containing its full set of effect types. For example:
[MayBlock, Throws(typeof(FileNotFoundException)), NoAllocations] void M() { … }
Is given the distinct effect type EMany { EMayBlock, EThrows(typeof(FileNotFoundException)), ENoAllocations }.
We should make one generalization before moving on. ENil ~ EMany { }. In other words, having no effects is equivalent to a list of no effects. Furthermore, EMany { } ~ EMany { ENil }. In other words, having a list of no effects is equivalent to having no effects.
Now we are ready to weave everything together. The main question confronting us is as follows: What is the subtyping relationship between the various effect types, including the null and list types?
The easiest to do away with is the EMany type. Given two EMany types E and F, then E <: F if, for all effects T in E’s type set, there exists an effect type U in F’s type set such that T <: U. In simpler terms, a list is a subtype of another list so long as all of its components are also subtypes of a component of the other. This is very abstract, but we shall see soon why it is useful.
Now we get to see why the additive and subtractive distinctions are so important:
Given an additive effect type EAdditive, we say ENil <: EAdditive.
Given a subtractive effect type ESubtractive, we say ESubtractive <: ENil.
The first statement says that a method with no effects is substitutable for a method with additive effects, and the second says that a method with subtractive effects is substitutable for a method with no effects. The corollaries are perhaps just as important. A method with additive effects cannot take the place of a method with no effects, whereas a method with subtractive effects can.
For the simple single-effect case, effects depicted in this way represent points on a line, where ENil is zero, subtractive effects are negative integers, and additive effects are positive integers. The lattice obviously becomes rather complicated as many effects accumulate.
Where does substitutability come up with respect to methods, anyway, you may ask? The first is in determining which other methods can be called. If a method M with effects E is trying to call another method N with effects F, this is permitted so long as F <: E. The next is in virtuals and overriding. A virtual with effects E may be overridden by a method with effects F so long as F <: E. The following example illustrates this idea, in addition to the composition of the subtyping rules we have shown so far:
class C { public virtual void M() {} [MayBlock, NoAllocations] public virtual void N() {} }
class D : C { [MayBlock, NoAllocations] public override void M() {} public override void N() {} }
In this example, the four methods are given the following effect types:
- C::M gets EMany { ENil }, or just ENil.
- C::N gets EMany { MayBlock, NoAllocations }.
- D::M gets EMany { MayBlock, NoAllocations }.
- D::N gets EMany { ENil }, or just ENil.
What does all this gibberish mean? Well it’s straightforward and intuitive, actually.
We are attempting to add the MayBlock and NoAllocations effects to the overridden M method which has none. Because MayBlock is additive, this is illegal (someone might call C::M thinking the code will not block), whereas it is OK for NoAllocations (calls through D::M are assured no allocations will happen, even though calls through C::M are guaranteed no such thing). Similarly, we are attempting to remove both effects from the overridden N method. Because MayBlock is additive, this is OK (M isn’t required to block, even though calls through C::M may suspect it of doing so), whereas it is decidedly not OK for NoAllocations (calls through C::M will reasonable assume allocations do not happen, whereas D::M would be free to perform them). It may take some thought to convince yourself that this is correct, but I hope that you find that it is. All of this works because of the subtyping of effect types.
All of this works similarly with delegates. The source delegate signature is akin to the base class in the above example, whereas the target method being bound to is like the override.
Things get a little more complicated when considering the EThrows effect. It is additive, so it is of course true that ENil <: EThrows(*). However, what if we have two different EThrows, and wish to inquire about substitutability of one in place of the other? We can come up with a simple rule that is general purpose for all set-of-type kinds of effects. Namely, consider two instances A and B of the same effect type:
Given an additive effect type EAdditive, then A <: B if, for all types T in A’s set-of-types, there exists a type U in B’s set-of-types such that T <: U.
Given a subtractive effect type ESubtractive, then A <: B if, for all types T in A’s set-of-types, there exists a type U in B’s set-of-types such that U <: T.
These sound quite similar, except that they end differently (i.e. T <: U vs. U <: T). We may illustrate the additive case with EThrows; to illustrate the subtractive case, let us imagine we can declare a ENoAllocations effect type that specifies which precise types may not be allocated:
class A{} class B : A{}
class C { [Throws(typeof(Exception)), NoAllocations(typeof(A))] public virtual void M() {} [Throws(typeof(FileNotFoundException)), NoAllocations(typeof(B))] public virtual void N() {} }
class D : C { [Throws(typeof(FileNotFoundException)), NoAllocations(typeof(B))] public override void M() {} [Throws(typeof(Exception)), NoAllocations(typeof(A))] public override void N() {} }
The results should not be surprising. D::M overrides C::M’s exception list, by being more specific and declaring that FileNotFoundException is thrown instead of just Exception. This is OK. Whereas D::N overrides C::N’s list by being more general purpose, specifying Exception instead of FileNotFoundException. This is clearly not OK. The NoAllocations type works in exactly the reverse. D::M attempts to prohibit allocations of B, but this is merely one possible subtype of the base method C::M’s declaration of A, and therefore this is illegal. Whereas D::N ensures no instances of A are allocated, which of course subsumes the base method C::N’s declaration that no B’s are allocated.
Everything gets a little more interesting when you consider generics. For example, how would we type a general purpose Map method? (This pattern arises quite frequently.) We would presumably want it to somehow “acquire” the effects of the delegate it invokes on all elements in a list. For example:
U[] Map<T, U>(T[] input, Func<T, U> func);
This declaration is stronger than necessary. The Func<T, U> class – prepackaged with the .NET Framework – does not have any effects on it. So it may not, for example, bind to a method that has any additive effects like Throws on it. This is rather unfortunate.
To solve this we could imagine treating effects with parametric polymorphism:
[Effects(E)] U Func<T, U, [EffectParameter] E>(T x);
This fictitious syntax merely says that Func can be instantiated with an effect type E, and that the Func “method” itself acquires the effect E. (Admittedly I should stop using faux-attribute syntax for illustrations since we’ve reached this level of language integration.) Now Map can be declared as such:
[Effects(E)] U[] Map<T, U, [EffectParameter] E>(T[] input, Func<T, U, E> func);
This says that Map has the same effects as the Func that is supplied as an argument. It turns out that we may want to extend this further, by enabling symbolic manipulation of effects. We may wish, for example, to specify that the Func is not allowed to block, by stating it does not have [MayBlock] in it. You could imagine using something very similar to generic constraints to achieve this. It is also interesting to allow concatenation of multiple effect types, both through partial and full specialization. For example, Map above may clearly have effects of its own. You also tend to want generic constraints like, 'where E : F', which of course just depends on the aforementioned subtyping rules. And of course C# 4.0's co- and contravariance can be applied to effects too.
Anyway, I have probably gone beyond most readers’ interest level in this subject. Things sure do get very interesting when you allow symbolic manipulation of effects. They get even more interesting when you begin to think of types as having “permissible effects” attached to them. However, the main thing I wanted to point out with this brief article is that this pattern arises quite frequently. And despite everyone struggling through what seem to be odd corner cases as they develop ad-hoc solutions, there really is a sound generalization behind it all. Many languages have first class effect typing, and I have found it liberating to think of many of these type system annotations through that lens. Perhaps you shall too.
 Sunday, February 28, 2010
Simon Peyton Jones was in town a couple weeks back to deliver a repeat of his ECOOP’09 keynote, “Classes, Jim, but not as we know them. Type classes in Haskell: what, why, and whither”, to a group of internal Microsoft language folks. It was a fantastic talk, and pulled together multiple strains of thought that I’ve been pondering lately, most notably the common thread amongst them.
In the talk, he compared polymorphism in Java-like languages (including C# which I will switch to referring to over Java hereforth) with ML and Haskell. In other words, how does a programmer commonly write code in each language that is maximally reusable? Of course, C# programmers primarily achieve this through subclassing, whereas functional programmers rely on type parameterization. Over the years, however, the former group has begun to borrow a great deal from the latter; as evidence, witness the growingly-pervasive use of generics in both Java and C# over the past decade. The talk was given mainly through the lens of this evolution, which appears to approach an interesting limit if projected far enough into the future.
Type classes came on the scene towards the end of the 1980’s, and immediately became a fertile seed for research and exploration in the relationship between subclass and parametric polymorphism. Type classes are much closer to subclass polymorphism than Haskell’s borrowed ML-style, which is to say parametric polymorphism. This is most intriguing because Haskell does not rely on subclassing, and so the mixture of two breeds new patterns.
I thought that it might be interesting to compare the mixture of subclass and parametric polymorphism in Haskell vis-à-vis type classes with the same in C# vis-à-vis a mixture of interfaces, generics, and generic constraints. Hence this post. We shall proceed by examining some basic type classes in Haskell with their equals in C#. Though similar, the dissimilarities are as stark as the similarities. And the lack of higher kinds -- particularly when combined with type classes -- means that some Haskell patterns simply are not expressible in C#.
The Simple Case: Equality (or Lack Thereof)
The most basic type class of all is Eq, which allows the comparison of two like-typed pieces of data. This may seem like a commodity if you ordinarily write code in languages like Java and C# which have a strong notion of object identity. In Haskell, however, equality is value equality over algebraic data types rather than objects, so polymorphism over equality operators is quite a bit more important. Indeed, as we shall see, Haskell’s approach is more powerful than == in Java-like languages. (Witness the neverending dichotomy between reference and value equality vis-à-vis Object.Equals in C#.) But alas, let us proceed by crawling in a series of logical steps, rather than leaping to the conclusions.
Haskell’s Eq type class is defined as such: class Eq a where
(==), (/=) :: a -> a -> Bool
x /= y = not (x == y)
x == y = not (x /= y)
As you see, Eq provides two operators: == and /=. Default implementations of each define == as the inverse of /= and /= as the inverse of ==. Not only is this a convenience, but it also specifies the desired contract implementations ought to abide by. Other types may become members of the Eq class by mapping the one or both operators to type-specific functionality. You will immediately recognize the similarity to virtual methods in OOP languages, where the operators can be overridden by subclasses.
Of course all of the primitive data types already implement Eq, so you get value equality over numbers, strings, etc. Imagine we declared a new Coords type – comprised of two integers – and want to make it a member of Eq also – wherein equality is determined by a pairwise comparison of each’s members:
data Coords = Coords { fst :: Integer, snd :: Integer }
We make Coords a member of the Eq type class, and thereby define equality over instance, through the ‘instance Eq Coords where’ construct. This maps type class functions to real implementation functions. The example here defines them inline, though you may of course refer to existing functions instead: instance Eq Coords where
(Coords fst1 snd1) == (Coords fst2 snd2) = (fst1 == fs2) && (snd1 == snd2)
Now we can take a ‘[Coords]’ and ask whether a particular ‘Coords’ exists within it.
A function may constrain a type variable to a certain class, and thereby access members of that class. For example, the following ‘isin’ function tests whether an instance of some type ‘a’ is contained within a list of type ‘[a]’. To do this, it demands that ‘a’ is a member of Eq using the syntax “Eq a =>”:
isin :: Eq a => a -> [a] -> Bool
x `isin` [] = False
x `isin` (y:ys) = x == y || (x `isin` ys)
The moral equivalent to the Eq type class in C# is not so easy to decide. The most obvious first guess is the built-in == and != operators. However, we will quickly find that this is not quite right, because these operators are not polymorphic in C#. To illustrate this point, let’s try to write the ‘isin’ method in C#, using generics and the == operator, for example:
bool IsIn<T>(T x, T[] ys)
{
foreach (T y in ys) {
if (x == y)
return true;
}
}
This function will not compile. The reason is that == and != in C# are not defined over all types (specifically not for value types). You can get IsIn to compile by restricting the T to a reference type:
bool IsIn<T>(T x, T[] ys) where T : class
{
… same as above …
}
Although this code is deceptively similar to the Haskell example, it is actually quite different. The == used to compare two instances compiles into the MSIL CEQ operator, effectively hard-coding an object identity comparison. Even if an overloaded == operator for a particular instantiated T is available, the compiler will not bind to it. Why? Because it is overloading and specifically *not* overriding. For example, say we had a MyData type and an overloaded == operator for comparing two instances:
class MyClass
{
public static bool operator ==(MyClass a, MyClass b) { return true; }
public static bool operator !=(MyClass a, MyClass b) { return false; }
}
According to this, all MyClass objects are equal. However, the following call yields the answer ‘false’:
IsIn<MyClass>(new MyClass(), new MyClass[] { new MyClass() });
The same problem arises should instances of MyClass get referred to by Object references. == and != do not perform any kind of virtual dispatch; the selection of implementation is chosen statically.
Perhaps it is the Equals method inherited from System.Object, then? This, at least, is virtual. And indeed, this gets much closer to Eq. Any type may override Equals, and a generic definition defined in terms of it dispatches virtually and allows subclasses to change behavior on a type-by-type basis: bool IsIn<T>(T x, T[] ys)
{
foreach (T y in ys) {
if (x == y || (x != null && x.Equals(y)))
return true;
}
return false;
}
(Even this is slightly different, because it assumes a certain type-agnostic behavior about nulls.)
This is cheating, however. We’ve taken advantage of the fact that someone thought to put an Equals method on System.Object, thereby giving all Ts such a method. There are clearly limits to how many crosscutting things can be added to System.Object before it becomes overwhelmed with concepts, not to mention the size (e.g. v-tables). Moreover, Equals on Object is weakly typed; a better solution is to use interfaces, like the IEquatable<T> interface that introduces a strongly typed Equals method:
public interface IEquatable<T>
{
bool Equals(T other); }
And to use a generic type constraint on IsIn’s T, much more akin to what ‘isin’ in Haskell above did:
bool IsIn<T>(T x, T[] ys)
where T : IEquatable<T>
{
foreach (T y in ys) {
if (x == y || (x != null && x.Equals(y)))
return true;
} return false;
}
This is cheating a little less, because we can implement an interface after-the-fact without impacting a class’s type hierarchy. This, in fact, looks remarkably similar to the Haskell ‘isin’ shown earlier, using type classes and parametric polymorphism, where here we have used interfaces in place of type classes.
We might be tempted to define a default NotEquals method over all IEquatable<T> instances, just like Haskell does by implementing the defaults for == and /= as the inverse of each other:
public static class Equatable
{
public static bool NotEquals<T>(this IEquatable<T> @this, IEquatable<T> other)
{
return !this.Equals(other);
}
}
This is not perfect. It is not polymorphic; see my previous post for an extensive discussion of this and related points. And what about nulls? If '@this' is null, the default implementation is going to AV. We’d need to bake in type-agnostic knowledge of null again. Sigh!
Sadly, it turns out this whole approach in general isn’t quite right anyway. For two reasons:
- First, we still infect the type in question with the interface being implemented; it cannot be done completely outside of the type’s definition, as with type classes.
- Second, type classes in Haskell do not actually require a value of the type in question to dispatch against the class’s functions, whereas we clearly do in the above example: we need to virtually dispatch against the object, and rely on this virtual dispatch to execute different code for each type. This will come up as we look at the numeric classes, but it is a critical difference.
A closer analogy is to use IEqualityComparer<T>:
public interface IEqualityComparer<T>
{
bool Equals(T x, T y);
}
(IEqualityComparer<T> in .NET also has a GetHashCode method on it. Let’s ignore that for now.)
Unfortunately, if our IsIn method were to use IEqualityComparer<T> to do its job, callers would be required to pass an instance explicitly; we cannot infer a “default” comparer based solely on the T:
bool IsIn<T>(T x, T[] ys, IEqualityComparer<T> eq)
{
foreach (T y in ys) {
if (eq.Equals(x, y))
return true;
}
return false;
}
Type classes actually function rather similarly, with two major differences:
- The interface object – called a dictionary – is passed and used implicitly.
- The mapping from types to dictionaries is done implicitly, whereas in .NET you’ll need to find an instance of the interface in question through other means.
This second difference is solved by a little hack in .NET. If you take a look at the EqualityComparer<T>.Default property, you shall see a lot of slightly gross reflection code to return an instance of IEqualityComparer<T> for any arbitrary T. The code checks some well-known types and conditions, and ultimately falls back to the aforementioned interfaces and default Equals method for the most general case. It’s not pretty, but it’s a beautiful hack given the tools at our disposal in C#.
A Harder Case: Polymorphic Numbers, on Output Parameters
The Eq type class is easy. The functions it defines are polymorphic on their inputs, but not on their outputs; both == and /= return Bool values. Once we transition to polymorphic output parameters or return values, we encounter a pattern quite different from that which is found in most .NET interfaces.
Let’s illustrate these differences by looking at Haskell’s Num type class:
class (Eq a, Show a) => Num a where
(+), (-), (*) :: a -> a -> a
negate :: a -> a
abs, signum :: a -> a
fromInteger :: Integer -> a
Here we see another feature of Haskell type classes: inheritance. Num derives from both Eq and Show – indicated by “(Eq a, Show a) => Num a” – the latter class of which we have not yet shown but is the moral equivalent to .NET’s Object.ToString method. It enables pretty printing of values, clearly something that would be expected to be common among all numeric data types. Haskell’s numeric class hierarchy is quite elegant, enabling highly polymorphic computations. A nice little tutorial of can be found here: http://www.haskell.org/tutorial/numbers.html.
But the question at hand is what the C# equivalent would be.
Our first approach would be to mimic the IEquatable<T> solution above:
interface INumeric<T>
{
T Add(T d);
T Subtract(T d);
T Multiply(T d);
T Absolute();
T FromInteger(int x);
}
This works fine, and primitive types in .NET could presumably implement it:
struct int : INumeric { .. }
struct float : INumeric { .. }
struct double : INumeric { .. }
…
This enables polymorphic code, like a Sum method, through the use of generic type constraints:
public static T Sum<T>(params T[] values)
where T : INumeric<T>
{
T accum = default(T);
foreach (T v in values)
accum = v.Add(accum);
return accum;
}
This example works great. Why then, you might wonder, doesn’t LINQ use this instead of providing special-case overloads of Average, Min, Max, Sum, etc. for all well-known primitive data types?
The primary reason is the performance hit taken to perform addition through O(N) interface calls versus O(N) MSIL ADD instructions. It is just a basic fact of life that today’s leading edge separate compilation techniques will not achieve parity with the hand specialized variants. While it is true that the JIT compiler *could* specialize the code for specific Ts and specific interfaces to emit more efficient instructions, like int, float, etc. over INumeric<T> calls, it will not do so today. This reduces the ability to share code – which admittedly is what we want here – and is tangled up in a judgment call based on heuristics. But I digress.
There is a larger problem that arises with other examples, at least from a language expressiveness point-of-view: the need to have an instance in hand to invoke interface methods. FromInteger, for example, is rather awkward to write. In fact, we cannot write a method with INumeric<T> like we could in Haskell:
public static T MakeT<T>(int value)
where T : INumeric<T>
{
… ? …
}
How do we invoke FromInteger, given that no T is available at the time of MakeT’s invocation? You can’t; you need to arrange for an instance to be available. There are ways out of this corner. One solution is to mandate that T has a default constructor:
public static T MakeT<T>(int value)
where T : INumeric<T>, new()
{
return new T(value).FromInteger(value);
}
That is always acceptable for structs, since they always have such a constructor; but this practice requires that classes be designed to possibly not hold invariants at all times, and so is not always acceptable or at the very least requires design accommodation.
The alternative is probably obvious. Use a similar approach to IEqualityComparer<T>:
interface INumericProvider<T>
{
T Add(T x, T y);
T Subtract(T x, T y);
T Multiply(T x, T y);
T Absolute(T x);
T FromInteger(int x);
}
And now, of course, each method that does polymorphic number crunching must accept an instance of INumericProvider<T>. That’s particularly cumbersome, so it’s more likely that .NET developers would prefer the aforementioned approach, where the type must provide a default constructor.
Admittedly, I seldom run into this particular problem in practice; but when I do, I really wish I had something like Haskell type classes to help me out.
Before moving on, it is worth pointing out one Haskell type class problem that explicit interface object passing in .NET helps to avoid. Should you need multiple implementations of a given class for the same type, as is relatively common with equality comparisons, you must disambiguate in Haskell by separation by module and being careful about what you import. This is similar to C#’s extension methods. With explicitly passed interface objects, however, it is trivial to manage and pass separate objects if you’d like.
Close, but No Cigar: Higher Kinds
There is one last feature that Haskell provides – a pretty big one, I might add – that C# simply cannot do: higher kinded types, or polymorphism over constructed types. This feature is orthogonal to type classes, but gets used pervasively in conjunction with them. An example will make this stunningly clear:
class Monad m where
(>>=) :: m a -> (a -> m b) -> m b
(>>) :: m a -> m b -> m b
return :: a -> m a
fail :: String -> m a
m >> k = m >>= \_ -> k
fail s = error s
Let’s try to transcribe the core of this class in C#, renaming >>= to Bind, and omitting the >> and ‘fail s’ operators because they have default implementations: public interface IMonad<M, A>
{
M<B> Bind<B>(M<A> m, Func k);
M<A> Return(A a);
M<A> Fail(string s);
}
This approach is tantalizingly close. It suffers from the already-admitted problem that, for any M<A> instance, you will need to pass the appropriate IMonad<A> provider object – just as with the IEqualityComparer<T> and INumericProvider<T> examples above.
But the code of course won’t *actually* compile, because the type variable M cannot be constructed as shown here. We find references to M<A> and M<B>, which are complete nonsense to C#. M is just a plain type variable. M is required to be what Haskell calls a type constructor (* -> *), which is a generic type that must be instantiated before it is a terminal type. I’ve written about this before. Although it seems like a trivial omission in C#’s language definition, it strikes at the heart of the type system.
A fictitious syntax for expressing this in C# might be:
public interface IMonad
where M : <>
{ ...
}
And if, say, M were expected to be a two- or three-parametered type, we would find, respectively:
... where M : <,>
... where M : <,,>
And so on.
This could in theory work. But C# -- and more worrisome .NET and the CLR – do not support this presently, and, to be quite honest, likely never will. It is immensely powerful, however. Life without monads is a life destined to continuous repetition. The “LINQ Pattern”, for example, is one example case in .NET where, for each ‘source’ type, we must create a “copy” of the original System.Linq.Enumerable variant. And shame on those who wish to write polymorphic code that will work for any LINQ provider.
Winding Down
Let’s wind down. I need to go grab dinner at Mama's Fish House on Maui right now.
I hope to have shown some of the similarities and dissimilarities between type classes and interfaces, and some patterns that arise when these things are mixed with parametric polymorphism. The mix of inheritance for type classes, but not for implementation types, in Haskell is unique. C#, of course, allows inheritance both amongst interfaces and implementations which is both a blessing and a curse.
I do think both camps have something to teach one another. For example, having a default interface lookup mechanism for arbitrary types in C# would be wonderful, and indeed might provide a replacement for extension methods that has more longevity. I’m sure much of this will happen with time; either “in place” as the respective languages evolve, or as new languages are created with time.
But most importantly, I hope that the blog post was educational and fun. Enjoy.
 Tuesday, February 09, 2010
One of my comments in the 2nd edition of the .NET Framework design guidelines (on page 164) was that you can use extension methods as a way of getting default implementations for interface methods. We've actually begun using these techniques here on my team. To illustrate this trick, let's rewind the clock and imagine we were designing new collections APIs from day one.
Let's say we gave the core interfaces the most general methods possible. These may neither be the most user friendly overloads nor the ones that most people use all the time. They would, however, be those from which all the other convenience methods could be implemented. An INewList<T> interface that was designed with these principles in mind may look like this:
public interface INewList<T> : IEnumerable<T>
{
int Count { get; }
T this[int index] { get; set; }
void InsertAt(int index, T item);
void RemoveAt(int index);
}
This interface is missing all the nice convenience methods you will find on .NET's IList<T>, like Add, Clear, Contains, CopyTo, IndexOf, and Remove. So it's not really as nice to use. You can't write an API that takes in an INewList<T> and performs an Add against it, for example, like you can with IList<T>.
One approach to solving this might be to write a concrete class -- much like .NET's System.Collections.ObjectModel.Collection<T> -- that provides concrete implementations of all of these methods, and then other lists can simply subclass that. But we can do better.
Instead, let's give INewList<T> default implementations of all of these methods. How do we do this? That's right: with extension methods. Voila!
public static class NewListExtensions
{
public static void Add<T>(this INewList<T> lst, T item)
{
lst.InsertAt(lst.Count, item);
}
public static void Clear<T>(this INewList<T> lst)
{
int count;
while ((count = lst.Count) > 0) {
lst.RemoveAt(count - 1);
}
}
public static bool Contains<T>(this INewList<T> lst, T item)
{
return lst.IndexOf(item) != -1;
}
public static void CopyTo<T>(this INewList<T> lst, T[] array, int arrayIndex)
{
for (int i = 0; i < lst.Count; i++) {
array[arrayIndex + i] = lst[i];
}
}
public static int IndexOf<T>(this INewList<T> lst, T item)
{
var eq = EqualityComparer<T>.Default;
for (int i = 0; i < lst.Count; i++) {
if (eq.Equals(item, lst[i])) {
return i;
}
}
return -1;
}
public static bool Remove<T>(this INewList<T> lst, T item)
{
int index = lst.IndexOf(item);
if (index == -1) {
return false;
}
lst.RemoveAt(index);
return true;
}
}
Well isn't that neat. We've now given any INewList<T> implementations all these common methods without dirtying their class hierarchies, built atop a tiny core of extensibility. This is much like .NET's Collection<T> which exposes the core as abstract methods. Indeed, we can go even further. Any convenience overloads, like the multitude of CopyTos on List<T> in .NET, can be given to all INewList<T>'s also. And yet implementing INewList<T> remains as braindead simple as it was before: two properties and two methods. In fact, it's simpler than doing a more feature-rich IList<T>, because the convenience methods come for free.
It would be even niftier if you could add these methods straight onto INewList<T>, and have the C# compiler emit the extension methods silently for you. In other words:
public interface INewList<T> : IEnumerable<T>
{
... interface methods (as above) ...
void Add(T item)
{
InsertAt(Count, item);
}
void Clear()
{
int count;
while ((count = Count) > 0) {
RemoveAt(count - 1);
}
}
... and so on ... }
Although this would just be sugar for the NewListExtensions class shown earlier, it sure saves some typing and makes it the pattern more apparent and first class.
Though cool, this whole idea is certainly not perfect.
For one, there are no extension properties. So you can't use this trick for properties.
But the more obvious and severe downside to this approach that these methods are not specialized for the given concrete type. For example, the Clear method is potentially far less efficient than a hand-rolled List<T>, because it does O(N) RemoveAts rather than a single O(1) fixup of the count.
Recall now that the compiler binds more tightly to instance methods than extension methods. So we could implement our own little list class with a faster Clear method if we'd like:
class MyList : INewList<T>
{
... the two properties and two methods from INewList<T> ...
public void Clear()
{
.. efficient! ... }
}
Now when someone calls Clear on a MyList<T> directly, the compiler will bind to the efficient Clear.
This is still not perfect. If you pass the MyList<T> to an API that takes in an INewList<T>, any calls to Clear will fall back to the extension method. Extension methods are not virtual in any way. You can try to simulate virtual dispatch, but it gets messy quick. For example, say we defined an IFasterList<T> that includes all those convenience methods that lists frequently want to make faster; we can then do a typecheck plus virtual dispatch in the extension method.
For now, let's pretend that's just the Clear method:
public interface IFasterList<T> : INewList<T>
{
void Clear();
}
Of course, MyList<T> above would now implement IFasterList<T>. Invocations through IFasterList<T> will automatically bind to the faster variant; but if objects that implement IFasterList<T> get passed around as IList<T>s, you lose this ability. So the Clear extension method can now do a typecheck:
public static void Clear<T>(this INewList<T> lst)
{
IFasterList<T> fstLst = lst as IFasterList<T>;
if (fstLst != null) {
fstLst.Clear();
return;
}
int count;
while ((count = lst.Count) > 0) {
lst.RemoveAt(count - 1);
}
}
This works but is obviously a tedious and hard-to-maintain solution. It would be neat if someday C# figured out a way to "magically" reconcile virtual dispatch and extension methods. I don't know if there is a clever solution out there. I am skeptical. Nevertheless, despite this flaw, the above techniques are certainly thought provoking and interesting enough to play around with and consider for your own projects. And at the very least, it's fun. Enjoy.
 Friday, January 08, 2010
Sometimes you need to wait for something before proceeding with a computation.
Perhaps you need to know the value of some integer that is being computed concurrently. Maybe you need to wait for the bytes to flush to disk before telling another process the file is consistent and ready to read. Or you need to get that next row back from the database before painting it on the UI. It could be that you need to wait for the missile to leave the bay before closing the bay door. And so on.
And sometimes there’s simply nothing better to do while waiting for these things to happen other than to let the CPU halt (or let other processes on the machine run). You need to twiddle your thumbs a bit, and exhibit a little patience. Or at least your program does. This is simply an unfortunate fact of life.
This manifests numerous ways in our programming models:
1) Waiting on an event. 2) Waiting to acquire an already-held lock. 3) Finding that the GUI message queue is empty and doing a MsgWaitForMultipleObjectsEx. 4) Finding that the COM RPC queue is empty and doing a CoWaitForMultipleHandles. 5) Issuing an Ada rendezvous ‘accept’ and finding that no messages await you, thus blocking. 6) Issuing an Erlang ‘receive’ and finding that no messages await you, thus blocking. 7) Waiting on a .NET 4.0 task. 8) Issuing a ContinueWith on a .NET 4.0 task. 9) And so on.
There are three big distinctions to make about the characteristic nature of this waiting: namely, (1) what condition's establishment is being sought -- i.e. the reason for the wait, (2) whether multiple such conditions of interest may be waited on simultaneously, and, related, (3) whether waiting for said condition(s) necessarily means that the processing of some other conditions that may arise elsewhere, but require the blocked context to run, cannot occur.
I will be the first to admit that this statement is rather abstract. But it really does matter.
For example, MsgWaitForMultipleObjectsEx is a pumping wait. Not only do you wait for the occurrence of one of several events to get set, but the arrival of a new top-level message at the message queue (either GUI or COM RPC-related) causes immediate processing of that message, presuming the thread is blocked at that call at the time. Although you can be deeply nested in some complicated code, you “jump” to the event loop to run the message handling code. Vanilla WaitForMultipleObjectsEx works in a similar way vis-à-vis APCs, provided the wait is alertable. This is quite different from a fully blocking non-pumping wait, which only waits for one or more very specific events, but does not dispatch messages simultaneously.
Win32 esoterica aside, the concepts appear elsewhere. The moral equivalent in Ada or Erlang is to do a selective-accept or -receive, intentionally not dispatching certain messages that might arrive in the meantime. (To be fair, you can also do this in COM with message filters.) This often happens when you nest accepts and receives. You may be capable of processing messages A-Z at the top-level tail recursive loop; but if that nested accept only knows about message kinds M and N, then there are 24 other kinds that will not be picked up in the meantime.
Not pumping for messages is dangerous. And it can lead to deadlock if you pump for the wrong ones. Like if you’re accepting M or N, yet the triggering of M or N depends on first processing some message K waiting in the queue. COM RPCs with cycles run face first into this. And/or not pumping can lead to responsiveness and scalability problems. Perhaps M or N eventually does arrive, yet little old K needs to wait an indeterminate amount of time before it is seen. Whereas we could have overlapped its processing. This is why most STAs pump while waiting, and, similarly, why many Erlang processes consist of a main loop that is prepared to handle any message the process accepts at that top level loop. They may seem very different but they are strikingly not.
Yet paradoxically pumping for messages is also dangerous. You must predict all the kinds of messages that may reentrantly get executed, and your state at the point of the blocking call must be consistent enough to tolerate them. (At least those that will actually happen.) In COM STAs, this can be wholly unpredictable and indeed because the CLR auto-pumps on STAs the blocking points can be hidden. Overly aggressively admitting messages may seem like the right thing to do, until you’ve wedged yourself into some unforeseen inconsistent state. You can avoid this by making each message handler atomic; see Argus. But if you can't or don't have the discipline to do that, or aren't quite sure, you must not pump. You either avoid pumping altogether or you selectively pump messages that do not touch the state encapsulated by the pump. Or you lock access to state with a non-recursive lock and run the risk of deadlock.
I have found it clarifying to think about blocking in event loop concurrency and state machine terms, advancing from one state to the next in between waits. It’s a slippery model, but particularly when working in message passing systems that employ event loops, it can help to identify all the familiar problems with shared memory, blocking, and consistency.
Indeed it is interesting how blocking and non-blocking systems can rapidly approach each other. Starting from either extreme tempts you to tiptoe closer and closer to the middle. The familiarity of the other extreme tempts you. Until, alas, you just might meet in the middle.
 Sunday, January 03, 2010
Rewind the clock to mid-2004. Around this time awareness about the looming “concurrency sea change” was rapidly growing industry-wide, and indeed within Microsoft an increasing number of people – myself included – began to focus on how the .NET Framework, CLR, and Visual C++ environments could better accommodate first class concurrent programming. Of course, our colleagues in MSR and researchers in the industry more generally had many years’ head start on us, in some cases dating back 3+decades. It is safe to say that there was no shortage of prior art to understand and learn from.
One piece of prior art was particularly influential on our thoughts: software transactional memory. (STM, or, in short just TM.) In fact, right around that time, Tim Harris’s TM work grew in notoriety (my first exposure arriving by way of OOPSLA’03’s proceedings, which contained the “Language Support for Lightweight Transactions” paper). TM was immediately fascinating, and simultaneously promising. For a number of reasons:
- TM hid sophisticated synchronization mechanisms under a simple veil.
- It could be implemented using sophisticated (and scalable) techniques, again under a simple veil.
- It built on decades of experience in building scalable and parallel transactional databases.
- Among others. But most of all, it was a bright shiny light in a sea of complexity.
- And how fortunate: Tim was a colleague in our neighboring MSR Cambridge offices (and still is).
In a nutshell, TM offered declarative concurrency-safety. You declare what you’d like in as few simple words as possible, and you get what you want. In this case, those simple words are ‘atomic { S; }’.
Many people latched onto TM rapidly and simultaneously, both inside and outside of Microsoft. I hacked together a little prototype built atop SSCLI (“Rotor”), and another architect on our team built an even more feature-rich prototype using MSIL rewriting. We compared notes, began jointly exploring the design space, and talking more regularly with other colleagues like Tim in MSR. Soon thereafter we kicked off a small working group with about a dozen architects and researchers from around the company, aiming to articulate what a real productized TM might look like. Fun times.
We were eventually given the OK for an official “incubation” project, and multiple years’ of exploration and hard work ensued. In fact, the fruits of a team of many’s labor recently got released in the form of a Community Technology Preview -- a good conduit for experimentation, but with no commitment to add it to any of Microsoft’s products. To be clear, I had only a small part to play in this ambitious project, and mostly towards the start. Partway through, I stepped away to do PLINQ and Parallel Extensions to .NET, both of which are now part of the .NET Framework 4.0. Dozens of amazing people played a significant role in the project over the years. But I am getting way ahead of myself…
I’ve been away from the nitty-gritty day-to-day details of TM for about 3 years now, which feels sufficiently long to develop a healthy perspective on the project. So here it is. What follows is of course in no way Microsoft’s “official position” on the technology, but rather my own personal one. I’ve interspersed generalizations with specific details because that’s just how my brain thinks about TM.
Towards the North Star
A wondrous property of concurrent programming is the sheer number and diversity of programming models developed over the years. Actors, message-passing, data parallel, auto-vectorization, ...; the titles roll off the tongue, and yet none dominates and pervades. In fact, concurrent programming is a multi-dimensional space with a vast number of worthy points along its many axes.
This rich history is simultaneously a blessing and a daunting curse. But in any case can make for some very interesting multi-year-long immersion. My UW talk from 1 1/2 years ago just barely touches on the sheer breadth.
TM’s greatest virtue is the first word in its name: transactional. It turns out that, no matter your concurrent programming model du jour, three fundamental concepts crop up again and again: isolation (of state), atomicity (of state transitions), and consistency (of those atomic transitions). We use locks in shared-memory programming, coarse grained messages in message-passing, and functional programming to achieve all of these things in different ways. Transactions are another such mechanism, sure, but more than that, transactions are an all-encompassing way of thinking about how programs behave at their most fundamental core. Transaction is a religion.
Not everybody believes this, and of course why would they: it is an immensely subjective and qualitative statement. Some will claim that models like message passing entirely avoid the likes of “race conditions,” and such, but this is clearly false: state transitions are made, complicated state invariants are erected amongst a sea of smaller isolated states, and care must be taken, just as in shared memory. Even Argus, a beautiful early incarnation of message-passing (via promises) demands that messages are atomic in nature. This property is not checked and, if done improperly, leads to “races in the large.” Even Argus introduced the notion of transactions and persistence in the form of guardians.
Of course, message passing helps push you in the right direction. It is not, however, a panacea.
I was reading my ICFP proceedings recently and was reminded of research done in the context of Erlang that supports this assertion. In it, they apply CHESS-like techniques (with clever search space culling) to find race conditions. Indeed we use similar techniques very successfully for our message-passing programming models on my team here at Microsoft.
Transactions are terrific because they are “automatic”. You declare the boundaries, and the transactional machinery takes care of the rest. This is true of databases and also TM. Countless developers in the wild write massively concurrent programs by issuing operations against databases: they can do this so easily because they grok the simple façade that transactions provide. Numerous server-side state-based applications use transactions to shield programmers from the pitfalls of concurrency. Behold MSDTC. The bet we were making is that similar models would scale down just as well “in the small”.
The canonical syntactic footprint of TM is also beautiful and simple. You say:
atomic {
… concurrency-safe code goes here …
}
And everything in that block is magically concurrency-safe. (Well, you still need to ensure the consistency part, but isolation and atomicity are built-in. Mix this with Eiffel- or Spec#-style contracts and assertions like those in .NET 4.0, run at the end of each transaction, and you’re well on your way to verified consistency also. The ‘check E’ work in Haskell was right along these lines.) You can read and write memory locations, call other methods, all without worrying about whether concurrency-safety will be at risk.
For example, consider three transactions running concurrently:
int x = 0, y = 0, z = 0;
atomic { atomic { atomic {
x++; y++; z++;
} x++; y++;
} x++;
}
No matter the order in which these run, the end result will be x == 3, y == 2, z == 1.
Contrast this elegant simplicity with the many pitfalls of locks:
- Data races. Like forgetting to hold a lock when accessing a certain piece of data. And other flavors of data races, such as holding the wrong lock when accessing a certain piece of data. Not only do these issues not exist, but the solution is not to add countless annotations associating locks with the data they protect; instead, you declare the scope of atomicity, and the rest is automatic.
- Reentrancy. Locks don’t compose. Reentrancy and true recursive acquires are blurred together. If a locked region expects reentrancy, usually due to planned recursion, life is good; if it doesn’t, life is bad. This often manifests as virtual calls that reenter the calling subsystem while invariants remain broken due to a partial state transition. At that point, you’re hosed.
- Performance. The tension between fine-grained locking (better scalability) versus coarse-grained locking (simplicity and superior performance due to fewer lock acquire/release calls) is ever-present. This tension tugs on the cords of correctness, because if a lock is not held for long enough, other threads may be able to access data while invariants are still broken. Scalability pulls you to engage in a delicate tip-toe right up to the edge of the cliff.
- Deadlocks. This one needs no explanation.
In a nutshell, locks are not declarative. Not even close. They are not associated with the data protected by those locks, but rather the code that accesses said data. (For example: in the above code snippet, do we need three locks? Or one? Or …? Imagine we choose three: one for each variable, x, y, and z. What if we increment z, release its associated lock, and some other thread can now see the newly incremented z before the y and x get incremented. Whether this is acceptable depends on the program.) Sure, you can achieve atomicity and isolation, but only by intimately reasoning about your code by understanding the way they are implemented. And if you care about performance, you are also going to need to think about hardware esoterica such as CMPXCHG, spin waiting, cache contention, optimistic techniques with version counters and memory models, ABA, and so on.
The contrast is stark. Atomic-block-style transactions provide automatic serializability of whole regions of code, no matter what that code does, and the TM infrastructure does the rest, choosing between: optimistic, pessimistic, coarse, fine, etc. The linearization point of a transaction is clear: the end of the atomic block. TM can even adjust strategies based on the surrounding environment: hardware, dynamic program behavior, etc. (“Policy”.) In comparison to locks, TM is an order of magnitude simpler. There have even been studies whose conclusions support this assertion.
(Transactions unfortunately do not address one other issue, which turns out to be the most fundamental of all: sharing. Indeed, TM is insufficient – indeed, even dangerous – on its own is because it makes it very easy to share data and access it from multiple threads; first class isolation is far more important to achieve than faux isolation. This is perhaps one major difference between client and server transactions. Most server sessions are naturally isolated, and so transactions only interfere around the edges. I’m showing my cards too early, and will return to this point much, much later in this essay.)
TM also has the attractive quality of automatic rollback of partial state updates. (How did I get this far without discussing rollback?) Concurrency aside, this avoids needing to write backout code to run in the face unhandled exceptions. In retrospect this capability alone is almost enough to justify TM in limited quantities. Reams of code “out there” contain brittle, untested, and, therefore, incorrect error handling code. We have seen such code lead to problems ranging in severity: reliability issues leading to data loss, security exploits, etc. Were we to replace all those try/catch/rethrow blocks of code with transactions, we could do away with this error prone spaghetti. We’d also eliminate try/filter exploits thanks to Windows/Win32 2-pass SEH. Sometimes I wish we focused on this simple step forward, forgot about concurrency-safety, and baby stepped our way forward. Likely it wouldn’t have been enough, but I still wonder to this day.
We also toyed with the ability to replace reliability-oriented CER blocks with transactions. As you go through a transaction, there is a log of forward progress and how to undo it. So no matter the kind of failure, including OOM, you can rollback the partial state updates with zero allocation required.
At some point we began describing an ‘atomic’ block as though the program used a single global lock for all its concurrency operations. This would be grossly inefficient, of course, and fails to capture the precise isolation and rollback properties, but nevertheless conveys the basic idea. It also, as an aside, foreshadows a few of the difficult problems that lie ahead, namely strong vs. weak atomicity. Even though there is only one, if you forget to hold this one global lock while accessing shared data, you’ve still got a data race on your hands. This model won’t save you. We will return to this later on.
Tough Decisions: Life as a Starving Artist
We faced some programming model decisions requiring artistic license early on.
One that we quickly decided was whether to automatically roll back a transaction in response to an unhandled exception thrown from within. Such as with this code:
atomic {
x++;
if (p)
throw new Exception(“Whoops”);
}
If p evaluates to true, and hence an unhandled exception thrown, should that x++ be rolled back?
Most on the team said “Yes” as a gut reaction, whereas some argued we should require the programmer to catch-and-rollback by hand. We settled on the automatic approach because it seemed to do what you would expect in all the cases we looked at. Your transaction failed to complete normally and consistently. We also debated whether to support a unilateral “Transaction.Abort()” capability; while we agreed a “Transaction.Commit()” would be silly – the only way to commit a transaction being to reach its end non-exceptionally – the jury remained split on unilateral abort. We eventually found that, particularly when nesting is involved, the ability to detect a dire problem with the universe and bail unilaterally can be useful.
And we also hit some tough snags early on. Some were trivial, like what happens when an exception is thrown out of an atomic block. Of course that exception was likely constructed within the atomic block (‘throw new SomeException()’ being the most common form of ‘throw’), so we decided we probably need to smuggle at least some of that exception’s state out of the transaction. Like its stack trace. And perhaps its message string. I wrote the initial incarnation of the CLR exception subsystem support, and stopped at shallow serialization across the boundary. But this was a slippery slope, and eventually the general problem was seen, leading to more generalized nesting models (which I shall describe briefly below). Another snag, which was quite non-trivial, was the impact to the debugging experience. Depending on various implementation choices – like in-place versus buffered writes – you may need to teach the debugger about TM intrinsically. And some of the challenges were fundamental to building a first class TM implementation. Clearly the GC needed to know about TM and its logging, because it needs to keep both the “before” and “after” state of the transaction alive, in case it needed to roll back. The JIT compiler was very quickly splayed open and on the surgery table. And so on.
Throughout, it became abundantly clear that TM, much like generics, was a systemic and platform-wide technology shift. It didn’t require type theory, but the road ahead sure wasn’t going to be easy.
So we knocked down many early snags, and kept plowing forward, eagerly and excitedly. None of these challenges were insurmountable. We remained hopeful and happy (perhaps even blissful) to continue exploring the space of possible solutions. More irksome snags lurked right around the corner, however. And little did we know that some decisions we were about to make would subject us to some of the biggest such snags. TM’s greatest feature – slap an atomic around a block of code and it just gets better – would turn out to be its greatest challenge… but alas, I am again jumping ahead; more on that later.
Turtles, but How Far Down? Or, Bounded vs. Unbounded Transactions
Not all transactions are equal. There is a broad spectrum of TMs, ranging from those that are bounded to updating, say, 4 memory locations or cache lines, to those that are entirely unbounded. Indeed TM blurs together with other hardware-accelerated synchronization techniques, like speculative lock elision (SLE). The more constrained TM models are often hardware-hybrids, and the limitations imposed are typically due to physical hardware constraints. Models can be pulled along other axes, however, such as whether memory locations must be tagged in order to be used in a transaction or not, etc. Haskell requires this tagging (via TVars) so that side-effects are evident in the type system as with any other kind of monad.
We quickly settled on unbounded transactions. Everything else looked like multi-word CAS and, although we knew multi-word CAS would be immensely useful for developing new lock-free algorithms, our aim was to build something radically new and with broader appeal. If we ended up with a hardware-hybrid, we would expect the software to pick up the slack; you’d get nice acceleration within the hardware constraints, and then “fall off the silent cliff” to software emulation thereafter. Thus the unbounded approach was chosen.
In hindsight, this was a critical decision that had far-reaching implications. And to be honest, I now frequently doubt that it was the right call. We had our hearts in the right places, and the entire industry was trekking down the same path at the same time (with the notable exception of Haskell). But with the wisdom of age and hindsight, I do believe limited forms of TM could be wildly successful at particular tasks and yet would have avoided many of the biggest challenges with unbounded TM.
And believe me: many such challenges arose in the ensuing months.
An example of one challenge that didn’t threaten the model of TM per se, but sure did make our lives more difficult, is the compilation strategy we were forced to adopt. Transactions cost something. To transact a read or write entails a non-trivial amount of extra work; we spent a lot of time optimizing away redundant work, and developing new optimizations that reduced the overhead of TM. But at the end of the day, the cost is not zero – and in fact, the common case is far from it. Imagine you have an unbounded transaction model and are faced with compiling a particular method from MSIL to native code. A simple separate-module -based compiler (i.e. not whole-program) will not necessarily know whether this method will get called from a transaction, or from non-transactional code, such that in the worst case the method must be prepared for transactional access. There are a variety of techniques to use to produce code that supports both: the two extremes are (1) cloning, or (2) sharing w/ conditional dynamic checking. Neither extreme is particularly attractive, and this choice represents a classic space-time tradeoff that entails finding a reasonable middle ground. A JIT compiler can dynamically produce the version that is needed at the moment, but offline compilers – like the CLR’s NGEN – do not have this luxury. And within Microsoft at least, and among shrink-wrap ISVs, offline compilation is of greater importance than JIT compilation. For better or for worse.
The model of unbounded transactions is the hard part. You surround any arbitrary code with ‘atomic { … }’ and everything just works. It sounds beautiful. But just think about what can happen within a transaction: memory operations, calls to other methods, P/Invokes to Win32, COM Interop, allocation of finalizable objects, I/O calls (disk, network, databases, console, …), GUI operations, lock acquisitions, CAS operations, …, the list goes on and on. Versus bounded transactions, where we could say something like: if you do more than N things, the transaction will fail to run – deterministically.
Unbounded really was the golden nugget. But we should not be shy about what this decision implies.
Implementing the Idea
This leads me to a brief tangent on implementation. Given that we didn’t implement TM with a single global lock, as the naïve mental model above suggests, you may wonder how we actually did do it. Three main approaches were seriously considered:
- IL rewriting. Use a tool that passes over the IL post-compilation to inject transaction calls.
- Hook the (JIT) compiler. The runtime itself would intrinsically know how to inject such calls.
- Library-based. All transactional operations would be explicit calls into the TM infrastructure.
Approaches #1 and #2 would look similar, but the latter would be quite different. Instead of:
atomic {
x++;
}
Or:
Atomic.Run(() => {
x++;
});
You might say something like:
Atomic.Run(() => {
Atomic.Write(Atomic.Read(ref x) + 1);
});
With enough language work, we could have tried to desugar the latter into the former, but when you start crossing method boundaries, everything gets more complicated. (Do you create transactional clones of every method, and rewrite calls from ordinary methods to the transactional clone? This is easy to do with a rewriter or compiler, but quite difficult with a pure library approach.) We also knew we’d need to do some very sophisticated compiler optimizations to get TM’s performance to the point of acceptable. So we chose approach #2 for our “real” prototype, and never looked back.
After this architectural approach was decided, a vast array of interesting implementation choices remained.
We moved on to building the primitive library with all the TM APIs that the JIT would introduce calls into. We quickly settled an approach much like Harris’s (and, at the time, pretty much the industry/research standard): optimistic reads, in-place pessimistic writes, and automatic retry. That means reads do not acquire locks of any sort, and instead, once the end of the transaction has been reached, all reads are validated; if any locations read have been modified concurrently (or an uncommitted value was read), the whole transaction is thrown away and reattempted from the start. Writes work like locks. This approach makes reads cheap: a single read consists of reading the value, and a version number whose address is at a statically known offset. No interlockeds. This is great since reads typically far outnumber writes. Down the line, we explored adding more sophisticated policy than this, which I will detail in brief below.
So the compiler would inject hooks for the above code like so:
while (true) {
TX tx = new TX();
try {
// x++;
tx.OpenReadOptimistic(ref x);
int tmp = x;
tx.OpenWritePessimistic(ref x);
x = (tmp + 1);
if (!tx.Validate())
continue;
tx.Commit();
}
catch {
tx.Rollback();
throw;
}
}
Notice there are some obvious overheads in here:
- The atomic block becomes a loop (to support automated retry).
- A new transaction must be allocated and likely placed in TLS (if methods are called).
- A try/catch block is used to initiate rollback on unhandled exceptions.
- Each unique location read in a block requires at least one call to OpenReadOptimistic.
- Each unique location written requires at least one call to OpenWritePessimistic.
- Each location read must be validated (at Validate), and finally the transaction is committed (at Commit).
Much of the work in the compiler was meant to reduce these overheads. For example, if the same location is read multiple times, there’s no need to call OpenReadOptimistic more than once. If the compiler can statically detect this, it may elide some of the calls. If the same location is read and then written – as in the above example – only the write lock must be acquired. If no methods are called, the transaction object can be enregistered, and we needn’t add it to TLS so long as the exception trap code knows how to move it from register to TLS on demand. Et cetera.
There are other overheads that are not so obvious. Optimistic reads mandate that there is a version number for each location somewhere, and pessimistic writes mandate that there is a lock for each location somewhere.
A straightforward technique is to use a hashing scheme to associate locations with this auxiliary data: each address is hashed to index into a table of version numbers and locks. This leads to false sharing, of course, but reduces space overhead and makes lookup fast. Unfortunately, in a garbage collected environment, addresses are not stable and therefore hashing becomes complicated. You can use object hash codes for this purpose, but .NET hash codes are overridable; and generating them is not nearly as cheap as using the memory location’s address, which by definition is already in-hand. Other alternatives of course exist. You can associate version numbers and locks with the objects themselves, just like monitors and object headers/sync-blocks in the CLR: this provides object-granularity locking. Ahh, the age old tension of fine vs. coarse grained locking comes up again.
We eventually realized we’d want both optimistic and pessimistic reads, the latter of which worked a lot like reader/writer locks. We crammed all these into a clever little word-sized data structure which worked a lot like Vista’s SRWL data structure. Except that it also contained a version number.
It was always surprising to me what strange things in the runtime we bumped up against. We realized a nice GC optimization: instead of keeping strong references to all intermediary states in a transaction log, we could keep weak references to all but the “before” and “after” state. This is important when transacting synthetic situations like this:
static BigHonkinFoo s_f;
…
atomic {
for (int i = 0; i < 1000000; i++)
s_f = new BigHonkinFoo();
}
Of course you wouldn’t write that code exactly. But there’s no need to keep alive all but the s_f that existed prior to entering the atomic block and the current one at any given time. But this leads to particularly hairy finalization issues. If a finalizable object is allocated within a transaction (say BigHonkinFoo), and is then reclaimed, its Finalize() method will be scheduled to run on a separate thread. Yet the transaction log may contain references to it. Thus there is a race between the transaction’s final outcome and the invocation of the finalizer. We came up with a clever solution for this, but there were countless other clever solutions for various things not worth diving too deep into.
Hacking is fun. However, it was not going to be what made or broke TM as a model.
Disillusionment Part I: the I/O Problem
It wasn’t long before we realized another sizeable, and more fundamental, challenge with unbounded transactions. Finalizers touched on this. What do we do with atomic blocks that do not simply consist of pure memory reads and writes? (In other words, the majority of blocks of code written today.) This was not just a pesky question of how to compile a piece of code, but rather struck right at the heart of the TM model.
You already saw the OpenReadOptimistic, OpenWritePessimistic, Validate, Commit, and Rollback pseudo-TM infrastructure calls, each of which operated on memory locations. But what about a read or write from a single block or entire file on the filesystem? Or output to the console? Or an entry in the Event Log? What about a web service call across the LAN? Allocation of some native memory? And so on. Ordinarily these kinds of operations will be composed with other memory operations, with some interesting invariant relationship holding between the disparate states. A transaction comprised of a mixture still ought to remain atomic and isolated.
The answer seemed clear. At least in theory. The transaction literature, including Reuter and Gray’s classic, had a simple answer for such things: on-commit and on-rollback actions, to perform or compensate the logical action being performed, respectively. (Around this same time, the Haskell folks were creating just this capability in their STM, where ‘onCommit’ and ‘onRollback’ would take arbitrary lambdas to execute the said operation at the proper time.) Because we were working primarily in .NET – with a side project targeting C++ -- we decided to use the new System.Transactions technology in 2.0 to hook into inherently transactional resources, like transacted NTFS, registry, and, of course, databases.
(Digging through my blog, I found this article written back in June 2006 about building a volatile resource manager for memory allocation/free operations, just as an example.)
This worked, though we were quite obviously swimming upstream. Numerous challenges confronted us.
A significant problem was that not all operations are inherently transactional, so in many cases we were faced with the need to add faux transactions on top of existing non-transactional services. (Already-transactional services were easy, like databases. Except that mixing fine-grain TM transactions with distributed DTC transactions makes my skin crawl.) For example, how would you undo a write to the console? Well, you can’t, really. So we decided maybe the right default for Console.WriteLine was to use an on-commit action to perform the actual write only once the transaction had committed.
But in even thinking this thought, we realized we were standing on shaky ground. What if the WriteLine was followed by something like a ReadLine, for example, where the program was meant to wait for the user to enter something into the console (likely in response to the prompt output by WriteLine)? (This example is a toy, of course, but represents a more fundamental pattern common in networked programs.) The basic problem was immediately clear. Adding isolation to an existing non-isolated operation is not always behavior-preserving, particularly when I/O is involved. Sometimes it is necessary to step outside of the isolation that would otherwise get poured on top by a simple transactional model.
This particular problem isn’t specific to traditional I/O per se.
Foreign function interface calls through.NET’s P/Invoke suffer from like problems. A call to CreateEvent may be compensatable (via an on-rollback action) with a call to CloseHandle. But this is flawed. Once that event’s HANDLE is requested, and/or it is passed to other Win32 APIs like MsgWaitForMultipleObjects, then the isolation of the faux transaction is broken, and real state must be provided to the Win32 APIs. And if another thread were to look up that HANDLE – perhaps through a name given to it in the call to CreateEvent – it may be able to see and interact with that event before the enclosing transaction has been committed. The abstraction leaks. And even if the abstraction is perfect, it is obvious there’s quite a bit of work to be had in order to transact all the touch points between .NET and Win32, of which there are many. And I mean many.
Other issues wait just around the corner. For example, how would you treat a lock block that was called from within a transaction? (You might say “that’s just not supported”, but when adding TM to an existing, large ecosystem of software, you’ve got to draw the compatibility line somewhere. If you draw it too carelessly, large swaths of existing software will not work; and in this case, that often meant that we claimed to provide unbounded transactions, and yet we would be placing bounds on them such that a lot of existing software could not be composed with transactions. Not good.) A seemingly straightforward answer is to treat a lock block like an atomic block. So if you encounter:
atomic {
lock (obj) { … }
}
it is logically transformed into:
atomic {
atomic { … }
}
On the face of it, this looks okay. (Forget problems like freeform use of Monitor.Enter/Exit for now.) We’re strengthening the atomicity and isolation, so what could go wrong? Well, it turns out that examples like this can also suffer from the “too much isolation” problem. Adding transactions to a lock-block extends the lifetime of the isolation of that particular block’s effects, possibly leading to lack of forward progress. In fact, you don’t need locks to illustrate the problem. Imagine a simple lock-free algorithm that communicates between threads using shared variables:
volatile int flag = 0;
…
flag = 1; while (flag != 1) ;
while (flag == 1) ; flag = 2;
If you invoke this code from within a transaction (on each thread), you’re apt to lead to deadlock. Both transactions’ effects will be isolated from the others’, whereas we are quite obviously intending to publish the updates to the flag variable immediately.
Anyway, the whole lock thing is a bit of a digression. The simple fact is that very little .NET code would actually run inside an atomic block but for things like collections and pure computations due to the I/O problem. You can develop one-off solutions for each problem that arises – and indeed we did so for many of them – and even hang those solutions underneath one general framework – like System.Transactions – but you cannot help but eventually become overwhelmed by the totality of the situation. The team experimented with static checking to turn these dynamic failures into static ones, but this only marginally improved matters.
I could go on and on about the I/O problem, its various incarnations, and what we did about it. Instead I will sum it up: this problem was, and still is, the “elephant in the room” threatening unbounded TM’s broader adoption.
The question ultimately boils down to this: is the world going to be transactional, or is it not?
Whether unbounded transactions foist unto the world will succeed, I think, depends intrinsically on the answer to this question. It sure looked like the answer was going to be “Yes” back when transactional NTFS and registry was added to Vista. But the momentum appears to have slowed dramatically.
Nesting
Let’s get back to some fun, less depressing material. There are more surprises lurking ahead.
I already mentioned a great virtue of transactions is their ability to nest. But I neglected to say how this works. And in fact when we began, we only recognized one form of nesting. You’re in one atomic block and then enter into another one. What happens if that inner transaction commits or rolls back, before the fate of the outer transaction is known? Intuition guided us to the following answer:
- If the inner transaction rolls back, the outer transaction does not necessarily do so. However, no matter what the outer transaction does, the effects of the inner will not be seen.
- If the inner transaction commits, the effects remain isolated in the outer transaction. It “commits into” the outer transaction, we took to saying. Only if the outer transaction subsequently commits will the inner’s effects be visible; if it rolls back, they are not.
For example, consider this code:
void f() { void g() {
atomic { // Tx0 atomic { // Tx1
x++; y++;
try { if (p1)
g(); throw new BarException();
} catch { }
if (p0) }
throw; }
}
if (p2)
throw new FooException();
}
}
Imagine x = y = 0 at the start, and we invoke f. Many outcomes are possible.
- If p1 is true, g will throw an exception, aborting Tx1’s write to y. There are then two possibilities. (1)If p0 is true, the exception is repropagated and Tx0 will also abort, rolling back its write to x; this leaves x == y == 0. (2) If p0 is false, the exception is swallowed, and Tx0 proceeds to committing its write to x; this leaves x == 1, whereas y == 1.
- If p1 is false, on the other hand, g will not throw anything. Tx1 will commit its write to y “into” the outer transaction Tx0. One of two outcomes will now occur depending on the value of p2. (1) If p2 is true, an exception is thrown out of f, and Tx0 rolls back both the inner transaction Tx1’s effects and its own, leaving x == y == 0. (2) Else, f completes ordinarily, and Tx0 commits both Tx1’s and its own effects, leading to x == y == 1.
We expected most peoples’ intuition to match this behavior.
The canonical working example was a BankAccount class:
class BankAccount {
decimal m_balance;
public void Deposit(decimal delta) {
atomic { m_balance += delta; }
}
public static void Transfer(
BankAccount a, BankAccount b, decimal delta) {
atomic {
a.Deposit(-delta);
b.Deposit(delta);
}
}
}
This was an illustrative and beautiful example. It made beautiful slide-ware. We are composing the Deposit operations of two separate bank accounts into a single Transfer method. Of course doing the a.Deposit(-delta) and b.Deposit(delta) must be made atomic, else a failure could either lead to missing money, and/or someone could witness the world with the money in transit (and nowhere except for one a thread’s stack) rather than having been transferred atomically. And building the same thing with locks is frustratingly difficult: using fine-grained per-account locks can lead to deadlock very quickly.
Intuitively we walked down many variants of this mode of nesting. We reacquainted ourselves with Moss’s great dissertation on the topic, and remembered this intuitive nesting mode as closed nested transactions. And we shortly recognized the need for another mode: open nested transactions.
To motivate the need for open nesting, imagine we’ve got a hashtable whose physical storage is independent from its logical storage. Resizing the table of buckets, for example, has little to do with whether a particular {key, value} pair exists within those buckets. The resizing operation, in fact, is logically idempotent and isolated: the same set of keys will exist within the table both before and after such an operation. So we can actually commit the physical effects of such an operation eagerly. With a naïve TM implementation, two independent keys hashing to the same bucket will conflict, and the reads and writes for such operations will live as long as the enclosing user-level transactions. Instead, we can serialize logical operations with respect to one another at a “higher level” than physically independent operations do, leading to greater concurrency. Two transactions will only conflict in long-running transactions if they truly operate on the same keys, rather than just happening to hash to the same bucket.
Open nesting forced us to contemplate the sharing of state between outer and inner transactions more deliberately, and gave us some troubles syntactically. We had wanted to say:
atomic { // ordinary closed nesting.
Foo f = new Foo();
atomic(open) { /// open nesting.
… f? …
}
}
But is it really legal for the inner transaction here to access the ‘f’, which has been constructed and is presumably uncommitted in the outer transaction? With closed nested transactions there is lock compatibility between the outer and inner transactions. An inner closed nested transaction can of course read a memory location write-locked by the outer transaction, for example. However, the same must not true of open nesting, because an open nested transaction commits “to the world” rather than into its outer transaction. Allowing it to read and then potentially publish uncommitted state would violate serializability. It’s possible that the inner open nested transaction will commit, whereas the outer will roll back. (The reverse situation is equally problematic.) And yet it’s darn useful to pass state from an outer to an inner transaction – and indeed, often impossible to do anything otherwise – yet what if the key itself were a complicated object graph rather than value, and the key bleeds across transaction boundaries?
Many issues like this arose. Our straightforward answer was that only pass-by-value worked across such a boundary. I don’t think we ever found nirvana here.
We developed other transaction modes also.
As we added data parallel operations within a nested transaction, we realized that we’d need something a lot like closed nesting but with special accommodation for intra-transaction parallelism. This led us to parallel nested transactions, enabling lock sharing from a parent to its many data parallel children. These children could not communicate with one another other than to “commit into” the parent, and subsequently reforking, thereby ensuring non-interference between them. Of course children could share read-locks amongst one another, just not write locks.
And we continued to reject the temptation of adding weakened serializability modes a la relational databases (unrepeatable reads, etc). Although we expected this to arise out of necessity with time, it never did; the various nesting modes we provided seemed to satisfy the typical needs.
A Better Condition Variable
Here’s a brief aside on one of TM’s bonus features.
Some TM variants also provide for “condition variable”-like facilities for coordination among threads. I think Haskell was the first such TM to provide a ‘retry’ and ‘orElse’ capability. When a ‘retry’ is encountered, the current transaction is rolled back, and restarted once the condition being sought becomes true. How does the TM subsystem know when that might be? This is an implementation detail, but one obvious choice is to monitor the reads that occurred leading up to the ‘retry’ – those involved in the evaluation of the predicate – and once any of them changes, to reschedule that transaction to run. Of course, it will reevaluate the predicate and, if it has become false, the transaction will ‘retry’ again.
A simple blocking queue could be written this way. For example:
object TakeOne()
{
atomic {
if (Count == 0)
retry;
return Pop();
}
}
If, upon entering the atomic block, Count is witnessed as being zero, we issue a retry. The transaction subsystem notices we read Count with a particular version number, and then blocks the current transaction until Count’s associated version number changes. The transaction is then rescheduled, and races to read Count once again. After Count is seen as non-zero, the Pop is attempted. The Pop, of course, may fail because of a race – i.e. we read Count optimistically without blocking out writers – but the usual transaction automatic-reattempt logic will kick in to mask the race in that case.
The ‘orElse’ feature is a bit less obvious, though still rather useful. It enables choice among multiple transactions, each of which may end up issuing a ‘retry’. I don’t think I’ve seen it in any TMs except for ours and Haskell’s.
To illustrate, imagine we’ve got 3 blocking queues like the one above. Now imagine we’d like to take from the first of those three that becomes non-empty. ‘orElse’ makes this simple:
BlockingQueue bq1 = …, bq2 = …, bq3 = …;
atomic {
object obj =
orElse {
bq1.TakeOne(),
bq2.TakeOne(),
bq3.TakeOne()
};
}
While ‘orElse’ is perhaps an optional feature, you simply can’t write certain kinds of multithreaded algorithms without ‘retry’. Anything that requires cross-thread communication would need to use spin variables.
Deliberate Plans of Action: Policy
I waved my hands a bit above perhaps without you even knowing it. When I talk about optimistic, pessimistic, and automatic retry, I am baking in a whole lot of policy. It turns out there is a wide array of techniques. The simplest question we faced early on was, when an optimistic read fails to validate at the end of a transaction, when should we reattempt execution of that transaction?
The naïve answer is “immediately”. But obviously that would lead to livelock under some conditions. A more reasonable answer is “spin for N cycles and then retry”. But this too can lead to livelock. A better answer is to either choose some random strategy, or to make an intelligent adaptive choice. We experimented with many such variants, including random backoff, sophisticated waiting and signaling based on the memory locations in question, among others. We even played games like giving transactions karma points for cooperatively acquiescing to other competing transactions, and allowing those transactions with the most karma points to make more forward progress before interrupting them.
A few good papers supplied useful (and entertaining) reading material on the topic, but to be honest, nobody had a good answer at the time. Thankfully these are all implementation details. So we were free to experiment.
Deadlock breaking also requires policy. Thankfully we can actually roll back the effects of transactions engaged in a deadly embrace with TM, so we merely need to know how often to run the deadlock detection algorithm. There was a similar problem when deciding to back off outer layers of nesting, and in fact this becomes more complicated when deadlocks are involved. Imagine:
atomic { atomic {
x++; y++;
atomic { atomic {
y++; x++;
} }
} }
This deadlock-prone example is tricky because rolling back the inner-most transactions won’t be sufficient to break the deadlock that may occur. Instead the TM policy manager needs to detect that multiple levels of nesting are involved and must be blown away in order to unstick forward progress.
Another variant that went beyond deciding when to favor one transaction over another was to upgrade to pessimistic locking if optimistic let us down. The whole justification behind optimistic is that, …well, we’re optimistic that conflicts won’t happen. So it seems reasonable that, if they do occur, we fall back to something more, …well, pessimistic. There is a dial here too. Perhaps you only want to fall back to pessimistic after failing optimistically N times in a row, where N > 1. As I mentioned above, our single-word lock associated with each object supported both locking and versioning cheaply.
Disillusionment Part II: Weak or Strong Atomicity?
All along, we had this problem nipping at our heels. What happens if code accesses the same memory locations from inside and outside a transaction? We certainly expected this to happen over the life of a program: state surely transitions from public and shared among threads to private to a single thread regularly. But if some location were to be accessed transactionally and non-transactionally concurrently, at once, we’d (presumably) have a real mess on our hands. A supposedly atomic, isolated, etc. transaction would no longer be protected from the evils of racey code.
For example:
atomic { // Tx0 x++; // No-Tx
x++;
}
Can we make any statements about the value of x after Tx0 commits (or rolls back)? Not really. It depends on the way the particular TM being used has been implemented. An in-place model that rolls back could not only roll back Tx0’s but also the unprotected x++’s write. And so on.
On one hand, this code is racey. So you could explain away the undefined behavior as being a race condition. On the other hand, it was also troublesome. All those problems with locks begin cropping up all over the place. It would have been ideal if we could notify developers that they made a mistake. Then we could have made the assertion that data races are simply not possible with TM.
(Except for consistency-related ones, of course.)
At the same time, many hardware models were being explored. And of course in hardware you’ve got the physical addresses that variables resolve to and needn’t worry about aliasing. So it was actually possible to issue a fault if a location was used transactionally and non-transactionally at once. But given that our solution was software-based, we were uncomfortable betting the farm on hardware support.
Another approach was static analysis. We could require transactional locations to be tagged, for example. This had the unfortunate consequence of making reusable data structures less, well, reusable. Collections for example presumably need to be usable from within and outside transactions alike. After-the-fact analysis could be applied without tagging, but false positives were common. We never really took a hard stance on this problem, but always assumed the combination of static analysis, tooling, and, perhaps someday, hardware detection would make this problem more diagnosable. But I think we generally resolved ourselves to the fact that our TM would suffer from weak atomicity problems.
We thought this was explainable. Sadly it led to something that surely was not.
Disillusionment Part III: the Privatization Problem
I still remember the day like it was yesterday. A regular weekly team meeting, to discuss our project’s status, future, hard problems, and the like. A summer intern on board from a university doing pioneering work in TM, sipping his coffee. Me, sipping my tea. Then that same intern’s casual statement pointing out an Earth-shattering flaw that would threaten the kind of TM we (and most of the industry at the time) were building. We had been staring at the problem for over a year without having seen it. It is these kinds of moments that frighten me and make me a believer in formal computer science.
Here it is in a nutshell:
bool itIsOwned = false;
MyObj x = new MyObj();
…
atomic { // Tx0 atomic { // Tx1
// Claim the state for my use: if (!itIsOwned)
itIsOwned = true; x.field += 42;
} }
int z = x.field;
...
The Tx0 transaction changes itIsOwned to true, and then commits. After it has committed, it proceeds to using whatever state was claimed (in this case an object referred to by variable x) outside of the purview of TM. Meanwhile, another transaction Tx1 has optimistically read itIsOwned as false, and has gone ahead to use x. An update in-place system will allow that transaction to freely change the state of x. Of course, it will roll back here, because isItOwned changed to true. But by then it is too late: the other thread using x outside of a transaction will see constantly changing state – torn reads even – and who knows what will happen from there. A known flaw in any weakly atomic, update in-place TM.
If this example appears contrived, it’s not. It shows up in many circumstances. The first one in which we noticed it was when one transaction removes a node from a linked list, while another transaction is traversing that same list. If the former thread believes it “owns” the removed element simply because it took it out of the list, someone’s going to be disappointed when its state continues to change.
This, we realized, is just part and parcel of an optimistic TM system that does in-place writes. I don’t know that we ever fully recovered from this blow. It was a tough pill to swallow. After that meeting, everything changed: a somber mood was present and I think we all needed a drink. Nevertheless we plowed forward.
We explored a number of alternatives. And so did the industry at large, because that intern in question published a paper on the problem. One obvious solution is to have a transaction that commits a change to a particular location wait until all transactions that have possibly read that location have completed – a technique we called quiescence. We experimented with this approach, but it was extraordinarily complicated, for obvious reasons.
We experimented with blends of pessimistic operations instead of optimistic, alternative commit protocols, like using a “commit ticket” approach that serializes transaction commits, each of which tended to sacrifice performance greatly. Eventually the team decided to do buffered writes instead of in-place writes, because any concurrent modifications in a transaction will simply not modify the actual memory being used outside of the transaction unless that transaction successfully commits.
This, however, led to still other problems, like the granular loss of atomicity problem. Depending on the granularity of your buffered writes – we chose object-level – you can end up with false sharing of memory locations between transactional and non-transactional code. Imagine you update two separate fields of an object from within and outside a transaction, respectively, concurrently. Is this legal? Perhaps not. The transaction may bundle state updates to the whole object, rather than just one field.
All these snags led to the realization that we direly needed a memory model for TM.
Disillusionment Part IV: Where is the Killer App?
Throughout all of this, we searched and searched for the killer TM app. It’s unfair to pin this on TM, because the industry as a whole still searches for a killer concurrency app. But as we uncovered more successes in the latter, I became less and less convinced that the killer concurrency apps we will see broadly deployed in the next 5 years needed TM. Most enjoyed natural isolation, like embarrassingly parallel image processing apps. If you had sharing, you were doing something wrong.
In Conclusion
I eventually shifted focus to enforcing coarse-grained isolation through message-passing, and fine-grained isolation through type system support a la Haskell’s state monad. This would help programmers to realize where they accidentally had sharing, I thought, rather than merely masking this sharing and making it all work (albeit inefficiently).
I took this path not because I thought TM had no place in the concurrency ecosystem. But rather because I believed it did have a place, but that several steps would be needed before getting there.
I suspected that, just like with Argus, you’d want transactions around the boundaries. And that you’d probably want something like open nesting for fine-grained scalable data structures, like shared caches. These are often choke points in a coarse-grained locking system, and often cannot be fully isolated, at least in the small. Ironically I am just now arriving there. In the system I work on I see these issues actually staring us in the face.
This is just my own personal view on TM. You may also be interested in reading the current STM.NET team’s views also, available on their MSDN blog.
For me the TM project was particularly enjoyable. And it was a great learning experience. I worked with some amazing people, and it was a privilege. You really had the sense that something big was right around the corner, and every day was a rush of enjoyment. Despite running as fast as we could, it seemed like we could just barely keep pace with the research community. Over time more and more researchers turned to TM, and I distinctly recall reading at least one new TM paper per week.
This was also the first time I realized that Microsoft, at its core, really does operate like a collection of many startups. Our TM work was a grassroots movement, and there was no official sponsorship for our effort at the start. It was just a group of people independently getting together to discuss how TM might fit into the direction the industry was headed. Eventually TM started showing up on slide decks in presentations to management, followed by dedicated TM reviews, and even a BillG review. I will never forget, a couple years after that review – during an overall concurrency review – Bill standing up at the whiteboard, drawing the code “atomic { … }” and asking something to the effect: “Why can’t you just use transactional memory for that?” I guess the idea stuck with him too.
Who knows. Maybe in 10 years, the world will be transactional after all.
 Sunday, November 01, 2009
Say you've got a Task<T>. Well, now what?
You know that eventually a T will become available, but until then you're out of luck. You could go ahead and be a naughty little devil by calling Wait on it -- blocking the current thread (eek!) -- or you could call ContinueWith on the task to get back a new Task<U>, representing the work you would do to create some new U object if only you presently had a T in hand. And then perhaps you will find yourself in the same situation for that U.
These are those dataflow graphs I mentioned in the previous blog post. Things of beauty.
To be more concrete about the situation I describe, imagine you've got the following IFoo interface:
interface IFoo
{
int Bar();
string Baz(int x);
}
Now, given a Task<IFoo>, you can't do anything related to an IFoo. And yet presumably that's why you've got the task in the first place: because you care about the IFoo. What if you ultimately want to invoke the Bar method, for example?
Task<IFoo> task = ...;
You can of course block the thread:
// Option A: block the thread.
int resultA = task.Result.Bar();
...
Or you can choose to program in a very clunky way:
// Option B: use dataflow.
Task<int> resultB = task.ContinueWith(t => t.Result.Bar());
But what if, instead, you could do something like this?
// Option C: magic.
Task<int> resultC = task.Bar();
Whoa, wait a minute. We're calling Bar() on a Task<IFoo>? Neat, but how can that be?
This is obviously a trick. All of the members of T are somehow being made available on the Task<T> object, so that they can be called before the task has actually been resolved to a concrete value. Of course, were we to allow this, what you get back to represent the result of such calls would need to be task objects too: hence we get back a Task<int> from the call on Bar(), instead of an int. This is similar to call streams in Barbara Liskov's Argus language (her primary focus immediately after CLU).
This kind of lifting from the inner type outward is much like what you get in languages that allow generic mixins. C# already has one semi-such type, though you may not realize it: Nullable<T> actually allows you to directly access interfaces implemented by T without needing to call Value on it. It's almost like Nullable<T> was defined as deriving from T itself which is clearly not actually possible (for numerous reasons, not the least significant of which is that it's a struct). Try it. This works because the type system treats Nullable<T> and T somewhat uniformly (though you'd be surprised by some dangers lurking within -- effectively Nullable<T> mustn't implement any interfaces *ever* otherwise a type hole would result). But I digress...
Unfortunately without deep language changes we can't get this to work the way we'd like. I have found numerous occasions where a general lifting capability in C# would be useful: Lazy<T> is but one example. That said, each time we run across an instance, it demands slightly different type system treatment, and it seems unlikely such a general facility would be as usable as the one off features.
Type systems aside, I am actually using a very dirty trick to make this work: I'm using the new System.Dynamic features in .NET 4.0 to do it all dynamically. You may love or hate this, depending on your stance on type systems. Being an ML guy, I'll let you figure out what I think. (Hint: gross hack!)
We can go further. (Although sadly I won't demonstrate how to do so in this blog post. I had wanted to go all the way, but need to get some actual language work done today, in addition to a little Riemann study, instead of having endless fun tinkering with Visual Studio 2010. Shucks.) Notice that Baz accepts an int as input. Well, what if all we've got is a Task<int>? We can of course also allow that to get passed in too:
Task<string> resultD = task.Baz(42); // Real input. Fine.
Task<int> arg = ...;
Task<string> resultE = task.Baz(arg); // A task as input! Cool!
But wait, there is more! It slices and dices too. The next trick is difficult -- if not impossible -- to do without far reaching language changes. But we could also even bridge the world of ordinary methods too, not just those that have been accessed by tunneling through a Task<T>. For example:
string f(int x) {...}
...
Task<int> task = ...;
Task<string> result = f(task);
Not to even mention:
Task<int> x = ...;
Task<int> y = ...;
Task<int> z = x + y;
This is deep. What we are saying is that anywhere a T is expected, we can supply a Task<T>. Of course once we've entered the world of tasks, we cannot escape until values actually begin resolving. So when we invoke the method f in this example, we of course get back a Task<string> for its result. Once we've stepped onto a turtle's back, well, it's turtles all the way down.
(Which reminds me of the well known tale:
A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast collection of stars called our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: "What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise." The scientist gave a superior smile before replying, "What is the tortoise standing on?" "You're very clever, young man, very clever", said the old lady. "But it's turtles all the way down!"
Tasks are not greasy hamburgers after all, as I had claimed in the last post, but rather they are turtles.
I've wasted all of my energy speaking of turtle hamburgers drenched in asynchronous aioli, and have left only a little to go over the hacked up implementation of this idea. Sigh. Well, we had better get to it.)
In summary: we'll just rely on dynamic dispatch to do the lifting, thanks to the new .NET 4.0 DynamicObject class. This is wildly less efficient than a proper type system design would yield, not to mention the utter lack of static type checking. Of course a proper implementation that designed for this from Day One would also avoid the tremendous amount of object allocation that relying on the current Task<T> objects and ContinueWith overloads imply. But nevertheless, this approach will allow us to at least have a good ole' time and stimulate the creative side of the noggin.
First, I shall provide an extension method for getting a DynamicTask<T> -- the thing that actually derives from DynamicObject and implements the custom dynamic binding:
public static class DynamicTask
{
public static dynamic AsDynamic<T>(this Task<T> task)
{
return new DynamicTask<T>(task);
}
}
Notice that this changes our calling conventions ever so slightly. Namely:
// Option C: magic.
Task<int> resultC = task.AsDynamic().Bar();
The AsDynamic places the caller into the lifted context. As invocations are made, the results become real tasks, and not dynamic ones, such that to continue the calling will require many AsDynamic()s. This is a minor inconvenience and we could certainly automatically wrap the return values in DynamicTask<T> objects if we wanted to eliminate this problem, i.e. to make chaining less verbose.
Second, we must implement the DynamicTask<T> class. We will do a very simple translation. Given a member access expression 'x.m', where m is either a field or property of type U, we will morph this into the new expression 'x.Task.ContinueWith(v => v.Result.m)', which is of type Task<U>. Similarly, given a method invocation 'x.M(a1,...,aN)', whose return value is of type U, we will morph it into the new expression 'x.Task.ContinueWith(v => v.Result.M(a1,...,aN))', which is of type Task<U> (or just Task if U is the void type). To support the ability to pass a task argument where an actual one is expected would require packing the argument with the target into an array, and doing a ContinueWhenAll on it.
(Perhaps I will illustrate how to do these other tricks in a later post, but I'm tight for time right now. I'm only sketching the general idea. Even in what I show below, things will be incomplete, because topics such as getting exception propagation right when tasks begin failing are tricky. Ideally the whole dataflow chain will be "broken" by such an exception. Additionally, I've only implemented what was necessary to get a few interesting examples working. The binder, for example, certainly has a few loose ends. Blog reader beware.)
Here is the implementation of DynamicTask<T>:
public class DynamicTask<T> : DynamicObject
{
private Task<T> m_task;
public DynamicTask(Task<T> task)
{
if (task == null) {
throw new ArgumentNullException("task");
}
m_task = task;
}
public Task<T> Task {
get { return m_task; }
}
public override DynamicMetaObject GetMetaObject(Expression parameter) {
if (parameter == null) {
throw new Exception("parameter");
}
return new TaskLiftedObject(this, parameter);
}
class TaskLiftedObject : DynamicMetaObject
{
...
}
}
Simple. All of the dynamic magic resides in the implementation of TaskLiftedObject, which derives from the DynamicMetaObject class. It is constructed with an instance of the DynamicTask<T> along with the expression tree that can be used to dynamically load up an instance of that task. All of the dynamic features work with expression trees. For example, in response to an attempt to invoke a method M on a DynamicTask<T>, our binder will need to find the right method M on the underlying T, and then return an expression tree that does the ContinueWith and so forth.
Let's start cracking open TaskLiftedObject:
class TaskLiftedObject : DynamicMetaObject
{
private DynamicTask<T> m_task;
public TaskLiftedObject(DynamicTask<T> task, Expression expression) :
base(expression, BindingRestrictions.Empty, task)
{
m_task = task;
}
We will override two of DynamicMetaObject's functions. BindGetMember is called when a member is accessed (like a property or field), whereas BindInvokeMember is called when a method call is made. There are several other methods that a proper binder would need to override in order to make delegate dispatch and such work properly. But this suffices to get started:
public override DynamicMetaObject BindGetMember(GetMemberBinder binder)
{
// We have a member access:
// x.m
//
// which must become:
// x.Task.ContinueWith(v => { v.Result.m; })
//
return new DynamicMetaObject(
MakeContinuationTask(Bind(binder.Name, -1), null),
BindingRestrictions.GetInstanceRestriction(Expression, Value),
Value
);
}
public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)
{
// We have a call:
// x.Foo(a1,...,aN)
//
// which must become:
// x.Task.ContinueWith(v => { v.Result.Foo(a1,...,aN); })
//
Expression[] argsEx = new Expression[args.Length];
for (int i = 0; i < args.Length; i++) {
argsEx[i] = args[i].Expression;
}
return new DynamicMetaObject(
MakeContinuationTask(Bind(binder.Name, binder.CallInfo.ArgumentCount), argsEx),
BindingRestrictions.GetInstanceRestriction(Expression, Value),
Value
);
}
Clearly the workhorses here are Bind and MakeContinuationTask. Bind is responsible for performing dynamic lookup for a matching member on T that has the requested Name and, if a method call is being made, the proper number of parameters. For brevity, I've omitted anything to do with argument type checking, an obvious hole that we'd want to fix some day:
private static MemberInfo Bind(string name, int argCount)
{
// Lookup the target member on the T, rather than the (Dynamic)Task<T>.
return
(from m in typeof(T).GetMembers(BindingFlags.Instance | BindingFlags.Public)
where m.Name.Equals(name) &&
(argCount == -1 ?
!(m is MethodInfo) :
((MethodInfo)m).GetParameters().Length == argCount)
select m).
Single();
}
Nothing too interesting here either -- just a bit of hacky reflection code done with a fancy LINQ query. If anything other than exactly one method was found, the call to Single() will throw an exception. If you want to see what a "real" dynamic binder looks like, you won't find it here: check out VB's or IronPython's.
Now for the meat. The MakeContinuationTask method takes the target member that we've found dynamically via Bind, as well as an optional array of expression trees, each representing an argument being passed to the target method (and which will be null for property and field access), and manufactures the expression tree that represents the execution of the dynamic call itself:
private Expression MakeContinuationTask(MemberInfo target, Expression[] targetArgs)
{
var lambdaParam = Expression.Parameter(typeof(Task<T>), "v");
var lambdaParamResult = Expression.Property(lambdaParam, "Result");
Expression lambdaBody;
Type lambdaReturnType;
if (target is MethodInfo) {
lambdaBody = Expression.Call(lambdaParamResult, (MethodInfo)target, targetArgs);
lambdaReturnType = ((MethodInfo)target).ReturnParameter.ParameterType;
}
else if (target is PropertyInfo) {
lambdaBody = Expression.Property(lambdaParamResult, (PropertyInfo)target);
lambdaReturnType = ((PropertyInfo)target).PropertyType;
}
else if (target is FieldInfo) {
lambdaBody = Expression.Field(lambdaParamResult, (FieldInfo)target);
lambdaReturnType = ((FieldInfo)target).FieldType;
}
else {
throw new Exception("Unsupported dynamic invoke: " + target.GetType().Name);
}
return Expression.Call(
Expression.Property(
Expression.Convert(this.Expression, typeof(DynamicTask<T>)),
typeof(DynamicTask<T>).GetProperty("Task")
),
GetContinueWith(lambdaReturnType), // ContinueWith
new Expression[] {
// v => { v.Result.M(a0,...,aN) }
Expression.Lambda(lambdaBody, lambdaParam)
}
);
}
You should be able to convince yourself that this code generates the desired transformation described earlier. It uses a method to find the overload of Task<T>.ContinueWith that we want to bind against, and invokes that on the Task<T> contained within the DynamicTask<T> against which the dynamic call was made. It is rather unfortunate that the CLR does not allow the void type as a generic type argument, so we have to be a little bit inconsistent with our treatment of void returns, by choosing a different ContinueWith overload.
If the above reflection code was called hacky, the ContinueWith lookup is worse. It's very inefficient, not to mention fragile (because it depends on the current layout of Task<T>'s overloads, what with instantiating generic methods and the like). C'est la vie:
private static MethodInfo GetContinueWith(Type returnType)
{
// @TODO: caching to avoid expensive lookups each time.
if (returnType == typeof(void)) {
return typeof(Task<T>).GetMethod(
"ContinueWith",
new Type[] { typeof(Action<Task<T>>) }
);
}
else {
foreach (MethodInfo mif in typeof(Task<T>).GetMethods()) {
if (mif.Name == "ContinueWith" && mif.IsGenericMethodDefinition) {
MethodInfo mifOfT = mif.MakeGenericMethod(returnType);
ParameterInfo[] mifParams = mifOfT.GetParameters();
if (mifParams.Length == 1 &&
mifParams[0].ParameterType == typeof(Func<,>).MakeGenericType(typeof(Task<T>), returnType)) {
return mifOfT;
}
}
}
}
throw new Exception("Fatal error: ContinueWith overload not found");
}
}
And that's it. With that, we can get dynamic invocations on unresolved T's via Task<T> objects. Nifty.
I'm not saying any of this is a really good idea. Honestly, I'm not. Of course, there's a kernel of a good idea there and the systems we are working on take this kernel to its extreme. By providing a programming model that encourages deep chains of datafow to be expressed speculatively in a natural and familiar manner, greater degrees of latent parallelism can lie resident in an application waiting to be unlocked as more processors become available. Doing it for real requires impactful changes to the language, supporting infrastructure, and particularly tooling. Just imagine what it means to break into a debugger to inspect deep dataflow graphs that have been constructed by compiler magic underneath you. And the use of ContinueWith is a little lame, because of course the target of our call may be something that can be run speculatively too with first class pipleining, rather than completely delaying the invocation of it.
So we won't be seeing lifted tasks in .NET anytime soon. Writing up this blog post was merely an excuse to toy around with the new C# dynamic features and to have a little recreational time. And to generate excitement about what .NET 4.0 holds in store. I hope you have enjoyed it. Now back to reality.
 Saturday, October 31, 2009
Well, Visual Studio 2010 Beta 2 is out on the street. It contains plenty of neat new things to keep one busy for at least a rainy Saturday. I proved this today.
Of course, Parallel Extensions is in the box. .NET 4.0's Task and Task<T> abstractions are used to implement such things as PLINQ and Parallel.For loops, but of course they are great for representing asynchronous work too. The FromAsync adapters move you from the dark ages of IAsyncResult to the glitzy new space age of tasks.
Not only are tasks tastier than hamburgers, but they enable complex dataflow graphs of asynchronous work to unfold dynamically at runtime, thanks to the ContinueWith method. From a Task<T> you can get a Task<U> that was computed based on the T; ad infinitum. We like dataflow. It is the key to unlocking parallelism, or more accurately, boiling away all else except for dataflow is the key. But what about control flow, you might ask? We like it less. But you can do it, so long as you put in some work. F#'s async workflows make this sort of thing a tad easier, but the raw libraries in .NET 4.0 don't come with any sort of loops or conditional capabilities. Perhaps in the future they will. Nevertheless, in this post I shall demonstrate how to build a couple simple ones.
Not because the lack of them is going to cause unprecidented and unheard of horrors, but rather because in doing so we'll see some neat features of tasks.
The two methods I will illustrate in this post are:
public static class TaskControlFlow
{
public static Task For(int from, int to, Func<int, Task> body, int width)
public static Task While(Func<int, bool> condition, Func<int, Task> body, int width)
}
Notice that each body is given the iteration index and is expected to launch asynchronous work and return a Task. The parameters that these methods take are probably obvious. Well, except for the last one. The "width" indicates how many outstanding asynchronous bodies should be in flight at once. The Task returned by For and While won't be considered done until all iterations are done, and any exceptions will be propagated as you might hope. It would be pretty useless otherwise.
For example, we could write a while loop that does something very silly:
TaskControlFlow.While(
i => i < 100,
i => { return CreateTimerTask(250).ContinueWith(_ => Console.WriteLine(i)); },
4
).Wait();
This just prints returns a "timer task" that completes after 250ms and prints out the iteration to the console. We pass a width of 4, so only four tasks will be outstanding at any given time. Notice we call Wait at the end, since both For and While return tasks representing the in flight work. This could have instead been written using a For loop as follows:
TaskControlFlow.For(0, 100,
i => { return CreateTimerTask(250).ContinueWith(_ => Console.WriteLine(i)); },
4
).Wait();
The CreateTimerTask method, by the way, looks like this:
private static Task CreateTimerTask(int ms)
{
var tcs = new TaskCompletionSource<bool>();
new Timer(x => ((TaskCompletionSource<bool>)x).SetResult(true), tcs, ms, -1);
return tcs.Task;
}
As something more realistic, imagine we wanted to do something with a large number of files, and don't want to block a whole bunch of threads in the process. The following "simple" expression will count up all of the bytes for all of the files in a particular directory, without once blocking the thread -- well, except for the initial call to Directory.GetFiles:
string win = "c:\\...\\";
string[] files = Directory.GetFiles(win);
int total = 0;
TaskControlFlow.For(0, files.Length,
i => {
bool eof = false;
int offset = 0;
byte[] buff = new byte[4096];
FileStream fs = File.OpenRead(files[i]);
return TaskControlFlow.While(
j => !eof,
j => Task<int>.Factory.
FromAsync<byte[],int,int>(
fs.BeginRead, fs.EndRead, buff, offset, buff.Length,
null, TaskCreationOptions.None
).
ContinueWith(v => {
if (eof = v.Result < buff.Length) {
fs.Close();
}
offset += v.Result;
Interlocked.Add(ref total, v.Result);
}),
1
);
},
8
).Wait();
Console.WriteLine(total);
Pretty neat. We've somewhat arbitrarily chosen a width of 8 for this loop. And notice something very subtle but important here: we've chosen a width of 1 for the inner loop that plows through the bytes of a file. This is because we're sharing state, and it would not be safe to launch numerous iterations at once. The same byte[], eof variable, and so forth, would become corrupt. I will mention in passing that it's unfortunate that we've got that interlocked stuck in there to add to the total. Refactoring this so that we could just do a LINQ reduce over the whole thing would be nice. Indeed, it can be done.
We can do away with the For implementation very quickly. It is just implemented in terms of While:
public static Task For(int from, int to, Func<int, Task> body, int width)
{
return While(i => from + i < to, body, width);
}
And it turns out that the While implementation is not terribly complicated either. Here it is:
public static Task While(Func<int, bool> condition, Func<int, Task> body, int width)
{
var tcs = new TaskCompletionSource<bool>();
int currIx = -1; // Current shared index.
int currCount = width; // The number of outstanding tasks.
int canceled = 0; // 1 if at least one body was cancelled.
ConcurrentBag<Exception> exceptions = null; // A collection of exceptions, if any.
// Generate a continuation action: this fires for each body that completes.
Action<Task> fcont = null;
fcont = tsk => {
if (tsk.IsFaulted) {
// Accumulate exceptions.
LazyInitializer.EnsureInitialized(ref exceptions);
foreach (Exception inner in tsk.Exception.InnerExceptions) {
exceptions.Add(inner);
}
}
else if (tsk.IsCanceled) {
// Mark that cancellation has occurred.
canceled = 1;
}
else if (canceled == 0 && exceptions == null) {
// If no cancellations / exceptions are found, attempt to kick off more work.
int ix = Interlocked.Increment(ref currIx);
if (condition(ix)) {
// Generate a new body task, handling exceptions. Then make sure we
// tack on the continuation on that new task, so we can keep on going...
// If the condition yielded 'false', we'll simply fall through and try to finish.
Task btsk;
try {
btsk = body(ix);
}
catch (Exception ex) {
btsk = AlreadyFaulted(ex);
}
btsk.ContinueWith(fcont);
return;
}
}
// If this is the last task, signal completion.
if (Interlocked.Decrement(ref currCount) == 0) {
if (exceptions != null) {
tcs.SetException(exceptions);
}
else if (canceled == 1) {
tcs.SetCanceled();
}
else {
tcs.SetResult(true);
}
}
};
// Fire off the right number of starting tasks.
for (int i = 0; i < width; i++) {
AlreadyDone.ContinueWith(fcont);
}
return tcs.Task;
}
I've commented the code inline to illustrate what is going on. The only other part that isn't shown are the AlreadyDone and AlreadyFaulted members, which simply give Tasks that are already in a final state. This isn't strictly necessary, but come in handy in a number of situations:
internal static Task AlreadyDone;
static TaskControlFlow()
{
var tcs = new TaskCompletionSource<bool>();
tcs.SetResult(true);
AlreadyDone = tcs.Task;
}
private static Task AlreadyFaulted(Exception ex)
{
var tcs = new TaskCompletionSource<bool>();
tcs.SetException(ex);
return tcs.Task;
}
And that's it. I'm done for now. Hope you enjoyed it. I've got a few other posts in the works -- primarily the result of a day full of hacking (I got in the office at 7am this morning, and have been here ever since, 14 hours later) -- demonstrating how to do speculative asynchronous work for if/else branches. Finally, I also have a neat example that illustrates how to do deep dataflow-based speculation without having to wait for work to complete. This combines the new .NET 4.0 dynamic capabilities with parallelism, so I'm pretty excited to get it working and write about it.
 Monday, October 19, 2009
Embarrassingly, I neglected to write about the oldest trick in the book in my last post: designing the producer/consumer data structure to reduce false sharing. As I've written about several times previously (e.g. see here), and more so in the book, false sharing is always deadly and must be avoided.
As a simple example, consider a program that merely increments a shared counter over and over again. If we give P threads their own separate counters, and ask them to increment the respective counter an equal number of times. Each thread can of course do this without synchronization, because the counters are distinct: no locks or even interlocked operations are necessary. Naively, one might expect that running P of them in parallel leads to no interference, and hence perfect parallelization. However, when I run a little benchmark on my 8-way machine, the numbers for increasing values of P tell a very different story:
1 = 22425789
2 = 42023726 (187%)
4 = 175828522 (784%)
8 = 333906288 (1489%)
It is clear that the throughput drops dramatically as P increases. The reason? Each counter, being only 8 bytes wide, shares a cache line with as many as 7 other counters -- or 15 if we're on a machine with 128 byte cache lines. A simple change to the counter's layout, so that individual counters do not share the same cache line, will remedy the situation. The numbers improve dramatically. In fact, they remain constant no matter the value of P:
1 = 21914250
2 = 21900392 (100%)
4 = 21865781 (100%)
8 = 21934008 (100%)
This perfect scaling isn't always possible due to memory bandwidth, but because we're just incrementing a single counter per core this doesn't manifest as a problem.
For what it's worth, the machine I am running these tests on is an 8-way, dual-socket, quad-core. Pairs of cores share an L1 cache, and all cores in a socket share an L2 cache. So the pairs {0,1}, {2,3}, {4,5}, and {6,7} are each expected to have distinct L1 caches and the groups {0,1,2,3} and {4,5,6,7} are expect to have distinct L2 caches. The 2 number above is run with two threads affinitized such that they share the same L1 cache. If we force them apart, however, we get slightly different results:
2 = 42023726 (187%) -- same L1 cache
2 = 54706505 (244%) -- same L2 cache
2 = 75030977 (335%) -- separate sockets
As expected, the more distance in the cache hierarchy, the greater the slowdown due to the increased ping pong paths.
The specific results are of course unique to my machine, but nevertheless the conclusion is clear: reducing sharing leads to substantial performance gains, particularly with large numbers of threads hammering on the shared lines. Often more so than eliminating other sources of wasted cycles, like interlocked operations. Eliminating those sources is clearly important too, but it really is amazing how deadly and yet difficult to discover false sharing can be: few cases are as obvious as this one.
One aside is worth mentioning before winding down. When I first ran this experiment, I had done it two ways: (1) with fields of a shared object, then using StructLayout(LayoutKind=Explicit) to keep fields apart, and (2) with counters crammed into an array, which then contains padding elements to eliminate the false sharing. The former is shown above. If you try the latter, you may be surprised. The layout of arrays on the CLR is such that an array's length resides before the first element. So unless you pad the first element of the array, all accesses will perform bounds checking that touches the first element's line. Because this line is being mutated by the thread incrementing the first counter, terrible false sharing results. To solve this, we must pad the first element too.
For example, here are the array numbers with false sharing:
1 = 27366202
2 = 125264714 (458%)
4 = 1383953372 (7969%)
8 = 3136996731 (11463%)
Notice the P = 8 case is over 100x slower! Yowzas. After fixing things, with the first element padded, we again observe perfect scaling:
1 = 27393869
2 = 27465999 (100%)
4 = 27370901 (100%)
8 = 27408631 (100%)
Clearly false sharing is not merely a theoretical concern. In fact, during our Beta1 performance milestone in Parallel Extensions, most of our performance problems came down to stamping out false sharing in numerous places: the partitioning logic of parallel for loops, polling cancellation token flags, enumerators allocated at the beginning of a PLINQ query and constantly mutated during its execution, and even in our examples (e.g. see Herb's matrix multiplication example), etc. It is terribly simple to make a mistake and, in a complicated system, terribly difficult to pinpoint the origin of what can be a truly crippling scalability bottleneck.
In the next post, we will go back and take a look at our single-producer / single-consumer buffer, and redesign it to have substantially better cache behavior.
~
For reference, here's the basic program used for a lot of these tests:
//#define CACHE_FRIENDLY
//#define USE_ARRAY
#pragma warning disable 0169
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Threading;
class Program
{
const int P = 1;
#if USE_ARRAY
class Counters
{
long[] m_longs;
internal Counters(int n) {
#if CACHE_FRIENDLY
m_longs = new long[(n+1)*16];
#else
m_longs = new long[n];
#endif
}
public void Increment(int i) {
#if CACHE_FRIENDLY
m_longs[(i+1)*16]++;
#else
m_longs[i]++;
#endif
}
}
#else // USE_ARRAY
#if CACHE_FRIENDLY
[StructLayout(LayoutKind.Explicit)]
#endif
struct Counters
{
#if CACHE_FRIENDLY
[FieldOffset(0)]
#endif
public long a;
#if CACHE_FRIENDLY
[FieldOffset(128)]
#endif
public long b;
#if CACHE_FRIENDLY
[FieldOffset(256)]
#endif
public long c;
#if CACHE_FRIENDLY
[FieldOffset(384)]
#endif
public long d;
#if CACHE_FRIENDLY
[FieldOffset(512)]
#endif
public long e;
#if CACHE_FRIENDLY
[FieldOffset(640)]
#endif
public long f;
#if CACHE_FRIENDLY
[FieldOffset(768)]
#endif
public long g;
#if CACHE_FRIENDLY
[FieldOffset(896)]
#endif
public long h;
}
static Counters s_c = new Counters();
#endif // USE_ARRAY
public static void Main(string[] args)
{
int p = int.Parse(args[0]);
const int iterations = int.MaxValue / 4;
ManualResetEvent mre = new ManualResetEvent(false);
#if USE_ARRAY
Counters c = new Counters(p);
#endif
Thread[] ts = new Thread[p];
for (int i = 0; i < ts.Length; i++) {
int tid = i;
ts[i] = new Thread(delegate() {
SetThreadAffinityMask(GetCurrentThread(), new UIntPtr(1u << tid));
mre.WaitOne();
for (int j = 0; j < iterations; j++)
#if USE_ARRAY
c.Increment(tid);
#else
switch (tid) {
case 0: s_c.a++; break;
case 1: s_c.b++; break;
case 2: s_c.c++; break;
case 3: s_c.d++; break;
case 4: s_c.e++; break;
case 5: s_c.f++; break;
case 6: s_c.g++; break;
case 7: s_c.h++; break;
}
#endif
});
ts[i].Start();
}
Stopwatch sw = Stopwatch.StartNew();
mre.Set();
foreach (Thread t in ts) t.Join();
Console.WriteLine(sw.ElapsedTicks);
}
[System.Runtime.InteropServices.DllImport("kernel32.dll")]
static extern IntPtr GetCurrentThread();
[System.Runtime.InteropServices.DllImport("kernel32.dll")]
static extern UIntPtr SetThreadAffinityMask(IntPtr hThread, UIntPtr dwThreadAffinityMask);
}
 Sunday, October 04, 2009
Commonly two threads must communicate with one another, typically to exchange some piece of information. This arises in low-level shared memory synchronization as in PLINQ’s asynchronous data merging, in the implementation of higher level patterns like message passing, inter-process communication, and in countless other situations. If only two agents partake in this arrangement, however, it is possible to implement a highly efficient exchange protocol. Although the situation is rather special, exploiting this opportunity can lead to some interesting performance benefits.
The standard technique for shared-memory situations is to use a ring buffer. This buffer is ordinarily an array of fixed length that may become full or empty. The two threads in this arrangement assume the role of producer and consumer: the producer adds data to the buffer and may make it full, whereas the consumer removes data from the buffer and may make it empty. It is possible to generalize this to multi-consumers or multi-producers, with some added cost to synchronization. What is described below is for the two thread case.
We will call this a ProducerConsumerRendezvousBuffer<T>, and its basic structure looks like this:
using System;
using System.Threading;
public class ProducerConsumerRendezvousPoint<T>
{
private T[] m_buffer;
private volatile int m_consumerIndex;
private volatile int m_consumerWaiting;
private AutoResetEvent m_consumerEvent;
private volatile int m_producerIndex;
private volatile int m_producerWaiting;
private AutoResetEvent m_producerEvent;
public ProducerConsumerRendezvousPoint(int capacity)
{
if (capacity < 2) throw new ArgumentOutOfRangeException("capacity");
m_buffer = new T[capacity];
m_consumerEvent = new AutoResetEvent(false);
m_producerEvent = new AutoResetEvent(false);
}
private int Capacity
{
get { return m_buffer.Length; }
}
private bool IsEmpty
{
get { return (m_consumerIndex == m_producerIndex); }
}
private bool IsFull
{
get { return (((m_producerIndex + 1) % Capacity) == m_consumerIndex); }
}
public void Enqueue(T value)
{
...
}
public T Dequeue()
{
...
}
}
There are some basic invariants to call out:
- Our buffer holds our elements, producer index says at what position the next element enqueued will be stored, and the consumer index says from what position the next request to dequeue an element will retrieve its value.
- We reserve one element in our buffer to differentiate between fullness and emptiness. This is why we demand that capacity be >= 2. We could alternatively know how to distinguish between a free slot and a used one, such as checking for null, but keep things simple for now.
- Thus, IsEmpty is true when the consumer and producer index are the same. Whereas IsFull is true when the consumer is one ahead of the producer, such that producing would make the producer index collide with the consumer index (otherwise leading to IsEmpty).
- It should be obvious that our intent is to block consumption when IsEmpty == true and production when IsFull == true. This is the point of the waiting flags and events.
Now let us look at the implementation first of Enqueue and then Dequeue, paying special attention to the necessary synchronization operations. They look nearly identical:
public void Enqueue(T value)
{
if (IsFull) {
WaitUntilNonFull();
}
m_buffer[m_producerIndex] = value;
Interlocked.Exchange(ref m_producerIndex, (m_producerIndex + 1) % Capacity);
if (m_consumerWaiting == 1) {
m_consumerEvent.Set();
}
}
Enqueue begins, as expected, by checking whether the queue is full. Notice that we have not yet issued any memory fences yet. The only thread that may make the buffer full is the current one, which will obviously not occur before proceeding, and therefore we needn’t perform any expensive synchronization operation for this check. The value seen may of course be stale but we can deal with that possibility inside the slow path, WaitUntilNonFull. We’ll look at that momentarily.
We then proceed to placing the value in the buffer at the current producer’s index. Only the current thread will update the producer index and a consumer will not read from the current value so long as the producer index refers to it. The value may not even be written atomically, e.g. for T’s that are greater than a pointer sized word. This is okay: only the act of incrementing the index allows a consumer to access the element in question. Writes on the CLR 2.0 memory model are retired in order and the reading side will use an acquire load of the index before accessing the element’s words. Indeed we could use complicated multipart value types that are comprised of lengthy buffers, header words, and so on.
We then increment the producer index, handling the possibility of wrap-around by modding with the capacity. This uses an Interlocked.Exchange for one simple reason: we are about to read a consumer waiting flag, and must prevent the load of that flag from moving prior to the producer index write. The consumer sets this flag when it notices the queue is empty and waits. This enables us to use a “Dekker style” check to minimize synchronization. We could have alternatively just unconditionally set the event, doing away with the interlocked operation altogether. But that call, if it involves kernel transitions, which is quite likely, is going to be much more expensive and would occur on every call to Enqueue. And any event of this kind that doesn’t require kernel transitions is going to at least require an interlocked operation for the same reason we require one here. An alternative technique involves setting when we transition the buffer from empty to non-empty or full to non-full, but this wastes a possibly expensive signal if the other party isn't even currently waiting. If full or empty is a rare situation, then full or empty and with a peer actually physically waiting is even rarer.
Let’s now look at the WaitUntilNonFull method. It’s really the reverse of what the consumer does, so based on everything said till this point, I am guessing it’s obvious:
private void WaitUntilNonFull()
{
Interlocked.Exchange(ref m_producerWaiting, 1);
try {
while (IsFull) {
m_producerEvent.WaitOne();
}
}
finally {
m_producerWaiting = 0;
}
}
We begin by issuing a memory fence and setting the producer waiting flag. This memory fence is necessary to advertise that we are about to wait, and also to ensure the subsequent check of IsFull is synchronized. The consumer does something very much like the producer does (above) after taking an element: if the producer is waiting, the consumer has made space for it and therefore must signal. But it could be the case that the consumer has already made the queue non-full before it could notice the producer’s waiting flag. We catch this by ensuring the producer’s check of IsFull cannot go before setting the producer waiting; similarly, the consumer cannot make IsFull false without subsequently noticing that the producer is waiting; this avoids deadlock.
Everything else is self explanatory. Well, almost. We need a loop here to catch one subtle situation. Imagine a producer enters into this method thinking the buffer is full. It sets its flag, and then immediately notices that the buffer is not full anymore. A consumer has generated a new item of interest. But imagine that consumer noticed that the producer was waiting and hence set its event. This is an auto-reset event, so the next time the producer must wait, the event will have already been set and it’ll likely wake up before IsFull has become true. An alternative way of dealing with this is to call Reset on the event if we didn’t actually wait on the event, but again we keep things simple.
At this point, the consumer side is going to look very familiar:
public T Dequeue()
{
if (IsEmpty) {
WaitUntilNonEmpty();
}
T value = m_buffer[m_consumerIndex];
m_buffer[m_consumerIndex] = default(T);
Interlocked.Exchange(ref m_consumerIndex, (m_consumerIndex + 1) % Capacity);
if (m_producerWaiting == 1) {
m_producerEvent.Set();
}
return value;
}
private void WaitUntilNonEmpty() {
Interlocked.Exchange(ref m_consumerWaiting, 1);
try {
while (IsEmpty) {
m_consumerEvent.WaitOne();
}
}
finally {
m_consumerWaiting = 0;
}
}
This is near-identical to Enqueue and WaitUntilNonFull, and so needs little explanation. The acquire load inside IsEmpty of the producer index ensures that we observe the producer index for this particular value being beyond the current consumer’s index before loading the value itself, thereby ensuring we see the whole set of written words. The one other thing to point out is that we “null out” the element consumed which, for large buffers, helps to avoid space leaks that would have otherwise been possible.
There are certainly some opportunities for improving this.
For example, we might add a little bit of spinning in the wait cases. This would be worthwhile for cases that exchange data at very fast rates and have small buffers, meaning that the chance of hitting empty and full conditions is quite high. Avoiding the context switch thrashing is likely to lead to hotter caches, because threads will remain runnable for longer, and the raw costs of switching themselves.
Additionally, we technically could use a single event if we wanted to spend the effort. We’d have to handle a few tricky cases, however: namely, the case where a producer or consumer ends up waiting on an event because it “just missed” the event of interest, thus satisfying the event. Indeed both threads could actually end up waiting on the event simultaneously and we need to somehow ensure the right one eventually gets awakened. This leads to some chatter and probably isn’t worth the added complexity.
Here is a peek at some rough numbers from a little benchmark that has two threads enqueuing and dequeuing elements as fast as humanly (or computerly) possible. This is a particularly unique and unlikely situation, but stresses the implementation in a few interesting ways. In particular, it will stress the contentious slow paths; although these are expected to be rarer, the fast paths are just so easy to get right in this data structure that they are mostly uninteresting to stress performance-wise. There are then a few variants, each based on the original version shown above:
- 2 element capacity, which means we’ll be transitioning from empty to full and back a lot.
- 1024 element capacity, which means we won’t.
- With spinning, using .NET 4.0’s new System.Threading.SpinWait type.
- An implementation that overuses interlocked operations as many naïve programmers would do.
The 2 element capacity situation is common in some message passing systems, e.g. Ada rendezvous, Comega joins, and the like. Whereas the 1,024 element capacity situation is common for more general purpose channels, where some amount of pipelining is anticipated.
I whipped together a benchmark -- so quickly that we can barely trust it, I might add -- to measure these things. Here’s a small table, showing the observed relative costs:
2 capacity 1024 capacity
As-is No-spin 100.00% 1.93%
Spin 56.41% 1.66%
Naïve No-spin 101.20% 2.09%
Spin 67.73% 1.87%
As with most microbenchmarks, take the results with a grain of salt. And there are certainly more interesting variants to compare this against, including a monitor-based implementation that locks around access to the buffer itself. Nevertheless, we can draw a few conclusions from this: as expected, the version that uses a single interlocked on enqueue and single interlocked on dequeue is faster than the naïve version that uses multiple (surprise!); spinning makes a much more interesting difference on the 2 element capacity situation, as expected, because it reduces the number of context switches dramatically; and, finally, the larger capacity enables a producer to race ahead of the consumer, hence avoiding far fewer transitions from full to empty to full and so forth.
This post was more of a case study than anything else. There is nothing conclusive or groundbreaking here, and I suppose I should have said that would be the case up front. That said, I’ve seen this technique used in over a dozen situations in actual product code now, so I figured I’d write a little about it, with a focus on how to minimize the synchronization operations. We even contemplated shipping such a type in Parallel Extensions to .NET, but it’s just too darn specialized to warrant it. So the closest thing we provided is BlockingCollection<T>. Enjoy.
 Monday, September 28, 2009
I've officially started down the long road of writing a 2nd edition of Concurrent Programming on Windows, and would like your help.
There are many great new features in Windows 7 and the next versions of .NET, Visual C++ / CRT, and Visual Studio. The book will of course cover them all.
But I am also looking to reshape the 1st edition in many dimensions. I'd like to focus on readability, conciseness, and clearly separating the "must know" topics from the more geeky and advanced ones. This is a common conundrum when writing a technical book. The advanced topics are more likely to appeal to readers of my blog, for instance, but may be daunting for newcomers to concurrency. Tradeoffs abound. Nevertheless the 2nd edition is likely to be slimmed down compared to the 1st.
Any and all feedback, suggestions, and ideas are welcome. What did you like about the 1st edition, and what did you not like? If you could change a handful of things, what would make the top of your list? And was it missing something crucial that you would like to see covered? Please send your feedback to joe AT@ acm DOT org, or simply leave comments here on the blog. Regardless of whether you've read the 1st edition or not.
I sincerely look forward to hearing from you. Cheers.
 Monday, July 27, 2009
I had originally entitled this post "Having your concurrency cake and eating it too", but it sounded a little too silly.
I have grown convinced over the past few years that taming side effects in our programming languages is a necessary prerequisite to attaining ubiquitous parallelism nirvana. Although I am continuously exploring the ways in which we can accomplish this -- ranging from shared nothing isolation, to purely functional programming, and anything and everything in between -- what I wonder the most about is whether the development ecosystem at large is ready and willing for such a change.
It is this that I find the most frightening. I know we can give the world Haskell, or Erlang, or simple incremental steps within familiar environments, like Parallel Extensions. (Indeed, the world already has these things.) But elevating effects to a first class concern in day-to-day programming turns out to be a tough pill to swallow. Particularly since the incremental degrees of parallelism that this switch will unlock are questionable (see this and this); and even if they were pervasive and impressive, it's unclear what percentage of developers will pay what specific price for a 2x, 4x, or even 16x increase in compute performance. It sounds great on paper, but the cost / benefit equation is a complicated one.
"Pay for play" is the standard terminology we use for such things around here, and the solution needs to have the right amount of it.
Many folks with embarrassingly parallel algorithms will succeed just fine in a shared memory + locks + condition variables world, and indeed have already begun to do so. And specialized tools -- like GPGPU programming -- have popped up that, when small kernels of computations are written in a highly constrained way, will parallelize, sometimes impressively. Is this enough? Perhaps for the next 5 years, but surely not much longer after that. It is in my opinion qualitatively very important for the future of computer science that we provide programming environments that are more conducive to safe and automatic parallelism. And yet I cannot stand up with a straight face and proclaim that each and every developer on the face of the planet should practice side effect abstinence. A healthy balance between cognitive familiarity and pragmatic [r]evolution must be found. Many promising approaches are in the works (see UIUC's DPJ), but we are years away.
Until then, parallelism on broadly deployed commercial platforms will likely remain in the realm of specialists.
Of course, Haskell and Erlang both accomplish the no effects feat in a sneaky way. For those interested in foisting parallelism unto the masses, lessons can be learned from these communities. If you buy into purely functional programming, you necessarily buy into programming without effects, and the (sparing) use of monads to represent them. (Or, as my colleague Erik calls it, fundamentalist functional programming).) And if you buy into large scale message passing, you (typically) necessarily also buy into programming without shared memory, leaving behind only strongly isolated effects. The key here is that developers gain many other benefits by switching to these platforms -- and the lack of effects is admittedly a consequential byproduct of this switch. The lack of effects are not center stage. The two approaches have recently begun to converge in what I believe to be the appropriate long-term approach: strong isolation with effects within, and safe, deterministic data parallelism through careful control over sharing, aliasing, and heap separation.
That said, though not center stage, the switch to effectless programming is certainly not painless.
Enabling side effects among otherwise functional code, I think, is a good thing, because it allows familiar algorithms to be encoded in an ordinary imperative way. Familiarity is key: it may sound two faced, but I don't think parallelism is sufficiently top of mind that developers will want to completely rearrange the way that they write software. Perhaps we will evolve in this direction, but a significant leap will fall flat. Moreover, many algorithms actually depend on stateful updates to achieve adequate performance, like write in place graphics buffers. The Haskell state monad strikes a nice balance between embedding imperative-looking effects, when coupled with the do notation, within a strictly functional language.
Furthermore, I really respect that Haskell discourages cheating. (Any unsafePerformIO is viewed with great suspicion.) I quite like mostly-functional programming languages like ML and Scheme, because they tend to be easier on programmers with C backgrounds, but strongly dislike that a mutation can lurk within what appears to be an otherwise pure function. Documenting side effects in the type system is healthy and allows better symbolic reasoning about the dependencies and implicit parallelism contained within, transitively, while still providing a way to get at effectful programming. Haskell does a great job at this. The elimination of dependence ought to be the focus of programmers, and not the elimination of ad-hoc and unstructured access to shared, mutable state. These are algorithmic and important concerns.
What remains unclear is where the boundaries lie. Part and parcel of documenting effects is thinking about them when designing your software. You need to consider whether IList<T>'s Contains method may mutate the list or not, for example, and hold the line on implementations of the interface. Either it returns an 'a' or an 'IO a' -- and this decision is one that has far reaching implications. This is a wholly separate kind of interface contract than what most programmers are accustomed to having to think about during the code-debug-edit cycle. And surely Python and JavaScript developers will not care one way or the other, particularly if it forces more design decisions up front than what is customary today. This bifurcation seems inevitable, and yet there is substantial crossover: C# developers will write Python scripts, and Python developers will consume components written in C#.
And yet, I think we need to venture down this path in order achieve automatically scalable software. Parallel computers have become incredibly cheap, and so the historical barriers into high performance technical computing have been whittled away to the software skills necessary to write scalable programs; we will likely succeed at expanding this market without radical changes, but if we stopped there, vast reams of client-side software will be left in the dust. I've been making inroads into solving the problem on my end, with a new language that sits between C# and Haskell. I'm biased, have been hard at work on this problem for many years, and yet still struggle to answer these fundamental questions. I am a big believer that there's got to be a happy medium out there. But I'm still very perplexed, and face some very high walls to hurdle. Who will discover the right balance, and when will they do so?
 Monday, July 13, 2009
In this blog post, I'll demonstrate building some very simple (but nice!) synchronization abstractions: a Lock type and a standalone ConditionVariable class. And we'll use a few new types in .NET 4.0 in the process. I had to implement a condition variable recently -- the joys of developing a new operating system / platform from the ground up -- and decided to put together a toy example for a blog post as I went. Warning: this is for educational purposes only.
Not to sound like a broken record, but it is a very good idea to manage locks intentionally. Doing so makes synchronization code easier to write, understand, and, correspondingly, maintain; given the difficult nature of concurrency, any opportunity for simplification is always welcomed. Yes, that means avoiding the CLR's dreadful capability to lock on arbitrary objects. (Which, by the way, is effectively just a holdover from the days where .NET was trying to woo developers from Java onto the platform.) In retrospect, this ability was a bad idea, and we should have provided and embellished a System.Threading.Lock class from Day One.
Well, rewind the clock and imagine we had provided such a Lock class. In fact, here's an overly simple one right here. I'm going to cheat a little, and reuse two locks that come with .NET 4.0: Monitor itself, and the new SpinLock class:
//#define SPIN_LOCK
public class Lock
{
#if SPIN_LOCK
private SpinLock m_slock = new SpinLock();
#else
private object m_slock = new object();
#endif
private ThreadLocal<int> m_acquireCount = new ThreadLocal<int>();
public void Enter() {
#if SPIN_LOCK
bool ignoreTaken;
m_slock.Enter(ref ignoreTaken);
#else
Monitor.Enter(m_slock);
#endif
m_acquireCount.Value = m_acquireCount.Value + 1;
}
public void Exit() {
m_acquireCount.Value = m_acquireCount.Value - 1;
#if SPIN_LOCK
m_slock.Exit();
#else
Monitor.Exit(m_slock);
#endif
}
public bool IsHeld {
get { return m_acquireCount.Value > 0; }
}
public int RecursionCount {
get { return m_acquireCount.Value; }
}
}
Okay, this is not rocket science. And to be fair, it's missing some critical features, like reliable acquisition (finally available on Monitor in 4.0, and also SpinLock), and lock leveling. But it's a start.
Once we've got such a Lock class, we may want to extend it with 1st class condition variable support. Condition variables are core to the monitor concept, and provide a synchronization point that combines a lock with some condition that may be waited upon and triggered. They help to avoid all the pitfalls of standalone events: mainly missed pulses due to the lack of synchronization involved between producers and consumers.
Furthermore, imagine we allow multiple separate ConditionVariable objects per single Lock object. This is a feature that Monitor doesn't currently support (though Win32 CONDITION_VARIABLEs do). This capability would enable us to, say, create a bounded buffer with a single lock to protect the queue, and two separate condition variables: one for the non-empty condition, and the other for the non-full condition. This simplifes the implementation, and helps to avoid deadlock-prone techniques that result from trying to use multiple separate synchronization objects.
The trick is that the Lock and ConditionVariable class need to be well-integrated. So we will provide a constructor that accepts a Lock object:
public class ConditionVariable
{
private Lock m_slock;
public ConditionVariable(Lock slock) {
if (slock == null)
throw new ArgumentNullException("slock");
m_slock = slock;
}
Once we've got that, there are two basic operations to implement: waiting and pulsing (signaling). To achieve this, we'll give each thread its own ManualResetEventSlim object -- a lightweight event class, new to .NET 4.0. (Ironically, it uses Monitor.Wait and Pulse under the covers.) This event will be stored in an instance of the new .NET 4.0 type, ThreadLocal<T>. (An alternative is to store it in a [ThreadStatic], and reuse the same event across all ConditionVariables. Since we only support waiting on one such condition at a time (currently), there is no reason we can't just have one per thread. This is precisely what the CLR does internally, though it's a shame we can't grab hold of that preexisting event.) In addition to that, we'll need a wait-list, maintained in FIFO order as a Queue<ManualResetEventSlim>:
private Queue<ManualResetEventSlim> m_waiters =
new Queue<ManualResetEventSlim>();
private ThreadLocal<ManualResetEventSlim> m_waitEvent =
new ThreadLocal<ManualResetEventSlim>();
Waiting does pretty much what you'd imagine. The m_slock object doubly acts as protection against concurrent access to the waiters list. So when a Wait call is made, we demand that the lock is held by the calling thread. Subtly, we also demand that it hasn't been recursively acquired, since that would require exiting the lock multiple times. This can lead to desynchronization bugs. Unfortunately, Monitor does this, but is critically broken as a result. Once the validation occurs, Wait simply places the current thread into the wait list, exits the lock, waits to be awakened, and then reacquires the lock before returning. This is pretty much exactly what the CLR Monitor class does internally:
public void Wait() {
int rcount = m_slock.RecursionCount;
if (rcount == 0)
throw new InvalidOperationException("Lock is not held.");
if (rcount > 1)
throw new InvalidOperationException("Lock is held recursively.");
// Lazily initialze our event, if necessary.
ManualResetEventSlim mres = m_waitEvent.Value;
if (mres == null) {
mres = m_waitEvent.Value = new ManualResetEventSlim(false);
}
else {
mres.Reset();
}
m_waiters.Enqueue(mres);
m_slock.Exit();
mres.Wait(); // bugbug: interrupt => desync.
m_slock.Enter();
}
Lastly, we must implement the Pulse and PulseAll methods. For kicks, we'll provide an overload of Pulse -- which normally awakens one waiting thread -- that awakens an arbitrary maximum number of threads. So you could say Pulse(4) to awaken at most 4 threads, for example. These methods are even simpler than Wait: they dequeue events off the wait list, and just set them. This awakens the waiters, as desired:
public void Pulse() {
Pulse(1);
}
public void Pulse(int maxPulses) {
if (!m_slock.IsHeld)
throw new InvalidOperationException("Lock is not held.");
for (int i = 0; i < maxPulses; i++) {
if (m_waiters.Count > 0) {
m_waiters.Dequeue().Set();
}
else {
break;
}
}
}
public void PulseAll() {
Pulse(int.MaxValue);
}
}
(This has the unfortunate side effect of two-step dances. The pulse will awaken threads at the mres.Wait() line in Wait, and they immediately try to call m_slock.Enter() as a result. A priority boost may cause them to preempt the pulsing thread, even though they will just end up waiting. A possible fix to this is to even more tightly integrate the Lock and ConditionVariable classes, by having a "deferred pulse" list attached to the lock. Once it has been completely exited, the Lock's Exit method could drain the deferred pulse list and awaken the threads, thus avoiding the two-step dance.)
As to examples, let's take a quick peek at a blocking / bounded queue. When constructed, a capacity is given. Whenever an Enqueue would cause the buffer's contents to exceed the capacity, the producer is blocked until space is made by a consumer. Whenever a Dequeue is attempted on an empty buffer, the consumer is blocked until an item is produced. Though there are opportunities for optimization, this is encoded straightforwardly as follows:
class BlockingQueue<T>
{
private int m_capacity;
private Queue<T> m_q;
private Lock m_qLock;
private ConditionVariable m_qNonFullCondition;
private ConditionVariable m_qNonEmptyCondition;
public BlockingQueue(int capacity) {
m_capacity = capacity;
m_q = new Queue<T>();
m_qLock = new Lock();
m_qNonFullCondition = new ConditionVariable(m_qLock);
m_qNonEmptyCondition = new ConditionVariable(m_qLock);
}
public void Enqueue(T item) {
m_qLock.Enter();
while (m_q.Count == m_capacity)
m_qNonFullCondition.Wait();
m_q.Enqueue(item);
m_qNonEmptyCondition.Pulse();
m_qLock.Exit();
}
public T Dequeue() {
T item;
m_qLock.Enter();
while (m_q.Count == 0)
m_qNonEmptyCondition.Wait();
item = m_q.Dequeue();
m_qNonFullCondition.Pulse();
m_qLock.Exit();
return item;
}
}
The naive approach typically uses a single event to signal the non-empty / non-full transitions. The risk of doing this, of course, is that the wrong kind of thread (producer or consumer) will be signaled, depending on chance and wait arrival order. This is ordinarily only a concern for bounded queues of reasonably small sizes, and high degrees of concurrency, but is still an interesting example of why multiple condition variables per lock is useful.
Enjoy!
 Tuesday, June 23, 2009
I wrote this memo over 2 1/2 years ago about what to do with concurrent exceptions in Parallel Extensions to .NET. Since Beta1 is now out, I thought posting it may provide some insight into our design decisions. I've made only a few slight edits (like replacing code- and type-names), but it's mainly in original form. I still agree with much of what I wrote, although I'd definitely write it differently today. And in retrospect, I would have driven harder to get deeper runtime integration. Perhaps in the next release.
~~~
Concurrency and Exceptions October, 2006
Exceptions raised inside of concurrent workers must be dealt with in a deliberate way. Failures can happen concurrently, and yet often the programmer is working with an API that appears to them as though it’s sequential. The basic question is, then, how do we communicate failure?
The problem
Fork/join concurrency, in which a single “master” thread forks and coordinates with N separate parallel workers, is an incredibly common instance of one of these sequential-looking concurrent operations. The same callback is run by many threads at once, and may fail zero, one, or multiple times. The exception propagation problem is inescapable here and comes with a lot of expectations, because the programmer is presented a traditional stack-based function calling interface papered on top of data or task parallelism underneath.
I am faced with the need for a solution to this problem for PLINQ right now and, while I could invent a one-off solution, we owe it to our customers to come up with a common platform-wide approach (or at least ManyCore-wide). Any solution should compose well across the stack, so that somebody invoking a PLINQ query from within their TPL task that was spawned from a thread pool thread yields the expected and consistent result. And I would like for us to reach consensus for both managed and native programming models.
Before moving on, there is one non-goal to call out. Long-running tasks not under the category of fork/join also deserve some attention, because of the ease with which stack traces can be destroyed and the corresponding impact to debugging, but I will ignore them for now. The problem is not new, exists with the IAsyncResult pattern, and PLINQ doesn’t use this sort of singular asynchronous concurrency. These cases can typically be trivially solved using existing mechanisms, like standard exception marshaling.
No errors, one error, many errors
To understand the core of the issue, imagine we have an API ‘void ForAll(Action a, T[] d)’. It takes a delegate and an array, and for every element ‘e’ in ‘d’ invokes the delegate, passing the element, i.e. ‘a(e)’. If multiple processors are available, the implementation of ForAll may use some heuristic to distribute work among several OS threads, for instance by partitioning the array, probably running one partition on the caller’s thread, and finally joining with these threads before returning so that the caller knows that all of the work is complete when the API returns.
ForAll is not fictitious, and is similar to a number of PLINQ APIs: Where, Select, Join, Sort, etc. It is also exposed directly by the TPL runtime’s Parallel class which intelligently forks and joins with workers.
‘a’ is a user-specified delegate and can do just about anything. That includes, of course, throwing an exception. What’s worse, because ‘a’ is run in several threads concurrently, there may be more than one exception thrown. In fact, there are three distinct possibilities:
- No errors: No invocations of ‘a’ throw an exception.
- One error: A single invocation of ‘a’ throws an exception.
- Many errors: Concurrent invocations of ‘a’ on separate threads throw exceptions.
Clearly letting an exception crash whichever thread the problematic ‘a(e)’ happened to be run on is problematic and confusing. If for no other reason than the IAsyncResult pattern has established precedent. But realistically, the developer would be forced to devise his or her own scheme to marshal the failure back to the calling thread in order for any sort of chance at recovery. They would get it wrong and it would lead to incompatible and poorly composing silos over time. A Byzantine model that fully prohibits exceptions passing fork/join barriers goes against the simple, familiar, and understandable (albeit often deceptively so) model of exceptions.
(That said, marshaling leads to a crappy debugging experience. An already attached debugger will get a break-on-throw notification at the exception on the origin thread, but since we catch, marshal, and (presumably) rethrow, the first and second chances for unhandled exceptions won’t happen until after the exception been marshaled. This breaks the first pass, and by the time the debugger breaks in, or a crash dump is taken, the stack associated with the origin thread is apt to have gone away, been reused for another task (in the case of the thread pool), etc. We generally try to avoid breaking the first pass in the .NET Framework, but do it in plenty of places: the BCL today already contains tons of try { … } catch { /*cleanup */ throw; }-style exception handlers, for example. For this reason I’m not terribly distraught over the implications of doing it ourselves. And sans deeper integration with the exception subsystem – something we ought to consider – there aren’t many reasonable alternatives.)
What makes this problem really bad is that ForAll appears as though it’s synchronous:
void f() {
// do some stuff
ForAll(…, …);
// do some more stuff, ‘ForAll’ is completely done
}
The method call to ForAll itself is synchronous, but of course its internal execution is not. But still, to the developer, the call to this function represents one task, one logical piece of work, regardless of the fact that the implementation uses multiple threads for execution. As higher level APIs are built atop things like ForAll, the low level parallel infrastructure problem becomes a higher level library or application problem. A Sort that is internally parallel must now decide what exception(s) it will tell callers it may throw.
Nondeterministic exception ordering
We assume the ForAll API stops calling ‘a(e)’ on any given thread when it first encounters an exception. That is, each thread just does something like this:
for (int i = start_idx; i < end_idx; i++) {
a(d[i]);
}
The for loop terminates when any single iteration throws an exception. Imagine our array contains 2048 elements and that ForAll smears the data across 8 threads, partitioning the array into 256-element sized chunks of contiguous elements. So partition 0 gets elements [0…256), partition 1 gets [256…512), …, and partition 7 gets [1792…2048). Now imagine that ‘a’ throws an exception whenever fed a null element, and that every 256th element in ‘d’, starting at element 10, is null. What can a developer reasonably expect to happen?
On one hand, if we’re trying to preserve the illusion of sequential execution, we would only want to surface the exception from the 10th element. With a sequential loop, this would have prevented the 266th, 522nd, and so on, elements from even being passed to ‘a’. So we might simply say that the “left most” exception (based on ordinal index) is the one that gets propagated. The obvious problem with this is there are races involved: subsequent iterations indeed may have actually run. Alternatively, we might consider only letting the “first” propagate. Unfortunately, that doesn’t work either, because we unfortunately can’t necessarily determine, for a set of concurrent exceptions, which got thrown first. Even if they have timestamps, they could occur in parallel at indistinguishably close times. Nor does this really matter, because it feels fundamentally wrong.
The reason is that we can’t simply throw away failures without true recoverability in the system, a la STM. The execution of code leading up to the exception did actually happen, after all, and there could be residual effects. We might be masking a terrible problem by throwing failures away, possibly leading to (more) state corruption and (prolonged, perhaps unrecoverable) damage. What if the 10th element was a simple ArgumentNullException that the caller chooses to tolerate, but the 266th element’s exception was in response to a catastrophic error from which the application can’t recover? We can’t choose to propagate the 10th but swallow the 266th. Broadly accepted exceptions best practices suggest that app and library devs never catch and swallow exceptions they cannot reasonably handle. We should do our best to follow the spirit of this guidance too.
Re-propagation
We could employ an approach similar to the IAsyncResult pattern, with some slight tweaks.
If each concurrent copy of ForAll caught any unhandled exceptions and marshaled them to the forking thread, including any exceptions that happen on the forking thread itself, we could then propagate all of them together after the join completes. The question is then: what exactly do we propagate?
If there is just a single exception, it’s tempting to just rethrow it. But I don’t believe this is a good approach for two primary reasons:
- This will destroy the stack trace of the original exception. This means no information about the actual source of the error inside ‘a’ is available. With some help from the CLR team, we might be able to get a special type of ‘rethrow’ that copied the original stack trace before recreating a new one. This is already done for remoted exceptions, and the Exception base class will prefix the original remoted stack trace to the new stack trace.
- This doesn’t scale to handle multiple exceptions. If we could solve #1, it might be attractive because it appears as-if things happened sequentially, but we can’t escape #2, no matter what we do. We could have different behavior in these two cases, but I believe it’s better to remain consistent instead. Otherwise, developers will need to write their exception handles two ways: one way to handle singular cases, and the other way to handle multiple cases, where the same API may do either nondeterministically.
Given that we need to propagate multiple exceptions, we should wrap them in an aggregate exception object, and propagate that instead. At least this way, the original exceptions will be preserved, stack trace and all. Of course the original exceptions themselves might be other aggregates, handling arbitrary composition.
For sake of discussion, call this aggregate exception System.AggregateException, which of course derives from System.Exception. It exposes the raw Exception objects thrown by the threads, via an ‘Exception[] InnerExceptions’ property, and additional meta-data about each exception: from which thread it was thrown, and any API specific information about the concurrent operation itself. This last part is just to help debuggability. For instance, we might tell the developer that the ArgumentNullException was thrown from a thread pool thread with ID 1011, and that it occurred while invoking the 266th element ‘e’ of array ‘d’. We might also guarantee the exceptions will be stored in the order in which they were marshaled back to the forking thread, just to help the developer (as much as we can) piece together the sequence of events leading to failures.
(Editor’s note: we decided against storing this meta-data information for various reasons.)
Now the dev can do whatever he or she wishes in response to the exception. Previously they might have written:
try {
ForEach(a, d);
} catch (FileNotFoundException fex) {
// Handler(fex);
}
And now they would have to instead write:
try {
ForAll(a, d);
} catch (AggregateException pex) {
List unhandled = new List();
foreach (Exception e in pex.InnerExceptions) {
FileNotFoundException fex = e as FileNotFoundException;
if (fex == null) {
unhandled.Add(fex);
} else {
// Handler(fex);
}
}
if (unhandled.Count > 0)
throw new AggregateException(unhandled);
}
In other words, they would catch the AggregateException, enumerate over the inner exceptions, and react to any FileNotFoundExceptions as they would have normally. (Taking into consideration that there might have been multiple.) At the end, if there are any non-FileNotFoundExceptions left over, we propagate a new AggregateException with the handled FileNotFoundExceptions removed. If there was only one remaining, we could, I suppose, try to rethrow just that, but this has the same nondeterminism problems mentioned above.
Very few people will write this code. But one of the most vocal arguments against it is: just throw one singular exception, such as ForAllException, and let it crash, because no developer will handle it. Well, that scheme is no better than throwing the AggregateException. At least the aggregation model lets people write backout and recovery code if they have the patience to deal with the reality that multiple exceptions occurred.
To make this slightly easier, we could expose an API, ‘void Handle(Func a) where T : Exception’, that effectively encapsulates the same logic as shown above, repropagating the exception at the end if all the exceptions weren’t handled (i.e. some weren’t of type T):
try {
ForAll(a, d);
} catch (AggregateException pex) {
pex.Handle(delegate(Exception ex) {
FileNotFoundException fex = ex as FileNotFoundException;
if (fex != null) {
// Handle(fex);
return true;
}
return false;
});
}
(One problem with this approach is that the ‘throw’ inside of Handle will destroy the original stack trace for ‘pex’. An alternative might be for Handle to modify the AggregateException in place, keeping the stack trace intact, returning a bool that the caller switches on and does a ‘throw’ if it returns false; this is unattractive because it’s error prone and could lead to accidentally swallowing, but in the end might help debuggability.)
If we cared about eliminating unnecessary catch/rethrows, we could use 1st pass filters instead, but this would only be available to VB and C++/CLI programmers, as C# doesn’t expose filters. For example, in pseudo-code:
try {
ForAll(a, d);
} catch (fex.InnerExceptions.Contains()) {
// Handle …
}
Although interesting, we’re trying to move away from our two pass model. So let’s forget about this for now.
This approach suffers when composing with non-aggregate exception aware code. For it to work well, everybody on the call stack needs to be looking inside the aggregate for “their” exception, handling it, and possibly repropagating. If we want existing BCL APIs to start using data parallelism internally, we would have to be careful here, not to break AppCompat because we start throwing AggregateExceptions instead of the originals.
This is probably where there’s an opportunity for better CLR and tool integration. For instance, you could imagine a world where the CLR automatically unravels the parallel failures, matching and running handlers for specific exceptions inside the aggregate as it goes, but repropagating if all exceptions weren’t handled. This is very hand-wavy and fundamentally changes the way exceptions work, so it would require a lot more thought. A catch block that swallows an exception (today) is just about guaranteed—asynchronous exceptions aside—that the IP will soon reach the next instruction after the try/catch block. This is a pretty basic invariant. With this proposal, that wouldn’t be the case, and would be bound to break large swaths of code. Sticking with the library approach (with all its imperfections) seems like the best plan of attack for now.
Waiting for the “join” to finish
There was something implicit in the design mentioned above. The ForAll API, and others like it, wouldn’t actually propagate exceptions until the fate of all threads was known.
Imagine we have the scenario described earlier (2048 elements, 8 threads), but slightly different: the 0th element causes an exception, but no other. It turns out this is probably a common case, i.e. that only a subset of the partitions will yield an exception. In this case, we would still have to wait for 7*256 = 1,792 elements to be run through ‘a’ before this exception is propagated. Imagine a slightly different case. The 0th element throws a catastrophic exception, and the application is going to terminate as soon as it propagates. ‘a’ simply can’t be run any more, and will keep reporting back this same exception. But it will take 8 of these exceptions to actually stop the application, i.e. by calling ‘a’ on the 0th, 256th, 512th, etc. elements, if we wait for all tasks to complete. If each exception corresponds to some failed attempt at forward progress, one that possibly corrupts state, then the damage is O(N) times “worse” (for some measurement) than in the sequential program, where N is the number of concurrent tasks.
Instead of waiting helplessly, we could try to aggressively shut down these concurrent workers.
At first, you might be tempted to employ CLR asynchronous thread aborts, but this is fraught with peril. Almost all .NET Framework code today is taught that thread abort == AppDomain unload, and reacts accordingly. State corruption stemming from libraries as fundamental as the BCL would be just about guaranteed. Changing this state of mind and the state of our software would be quite the undertaking.
Instead, we can have the concurrent API itself periodically check an ‘abort’ flag shared among all workers. The first thread to propagate an exception would set this flag. And whenever a worker has seen that it has been set, it voluntarily returns instead of finishing processing data:
for (int i = start_idx; i < end_idx && !aborted; i++) {
a(d[i]);
}
This increases the responsiveness of exception propagation, but clearly isn’t foolproof. There will still be a delay for long-running callbacks. Thankfully, with PLINQ, TPL, and I hope most of our parallel libraries, the units of work will be individually fine-grained, and therefore this technique should suffice.
If a concurrent worker is blocked, there’s not a whole lot we can do. Much like thread aborts, you might be tempted to use Thread.Interrupt to remove it from the wait condition. Unfortunately this will leave state corruption in its wake, because plenty of code does things like WaitHandle.WaitOne(Timeout.Infinite) without checking the return value or expecting a ThreadInterruptionException. The same argument applies to, say, user-mode APCs. Eventually you might also be tempted to use IO cancellation in Windows Vista to cancel errant, runaway network or disk IO requests. This would be great. But this also generally has the same problem as interruption, so until we find a general solution to that, we can’t do any of this.
(Editor’s note: We eventually solved this problem by coming up with a unified cancellation framework.)
One last note
This path forward seems best for now, but it leaves me wanting more.
In the end, this feels like a more fundamental problem. An API like ForAll gives the illusion of an ordinary, old sequential caller/callee relationship. But the callee doesn’t use a stack-based calling approach: instead, it distributes work among many concurrent workers, turning the linear stack into a sort of dynamically unfolding cactus stack (or tree). And SEH exceptions are fundamentally linear stack-based creatures.
In this world, it’s just a simple fact that data all over the place can become corrupt simultaneously. Many things can fail at once because many things are happening at once. It’s inescapable. Recovery is disastrously difficult, so most failures will end in crashes. STM’s promise for automatic recovery offers a glimmer of hope, but without it, I worry that papering a sequential “feel” on top of data/task parallelism is a dangerous game to play.
 Tuesday, June 16, 2009
One of my many focuses lately has been developing a memory ordering model for our project here at Microsoft. There are four main questions to answer when defining such a model:
- What are the ordering guarantees for ordinary loads and stores?
- What are the ordering guarantees for volatile loads and stores?
- What kinds of explicit fences are allowed?
- Where are fences used automatically, e.g. to preserve type safety and security?
These tend to be the differentiation points for any model. Everything else is mostly commodity. Not that there is much else, mind you, but respecting data dependence, not speculating ahead such that exceptions occur that wouldn't have occurred in a sequential execution, and so forth are all must haves, for instance. Most interesting permutations of answers for these questions have already been explored, and industry consensus is being reached, so it would be better to say I've been picking a model rather than defining one.
What's interesting is that memory model designers are often colored by their favorite architecture du jour. If somebody cares primarily about X86, they are apt to choose something very strong. If somebody cares primarily about ARM, however, they are apt to choose something very weak. There is a classic tradeoff here. Stronger means easier to program, while weaker means better performance. For some reason, many of the projects I've worked on have had an abundance of strong hardware (like X86) and a scarcity of weak hardware (like ARM and IA64). The reality sinks in: most developers on the team code to X86, and then when it comes time to getting more serious about the other platforms, code starts breaking all over the place. This is why the CLR went so strong in 2.0, even though IA64 was an important platform to support.
Let's look at some common answers to the above questions.
For #1:
- C++, Visual C++, ECMA 1.0, Java Memory Model, and Prism: no ordering guarantees.
- CLR 2.0: ordered stores, no ordering for loads.
For #2:
- C++: prevents compiler-only code motion, but explicit fences are needed for processor ordering.
- Visual C++, ECMA 1.0, and CLR 2.0: loads are acquire, stores are release ordered.
- Java Memory Model: loads and stores are fully ordered (sequentially consistent).
For #3:
- C++: implementation-specific.
- Visual C++: intrinsics and Win32 APIs.
- ECMA 1.0 and CLR 2.0: locks, and mostly Win32-style interlocked APIs.
- Java Memory Model: locks, compare-and-swap, atomics, etc.
For #4:
Managed environments like the CLR and JVM need to ensure type safety, even if ordinary loads and stores are unordered. This is nontrivial, because the boundary around type safety is blurred. Certainly we must ensure garbage v-table pointers are not seen. But is a thread allowed to read non-zeroed memory behind an object reference? And can it contain garbage (e.g. "values out of thin air")? What about writes done by mutator threads, including write barriers, while a concurrent collector is tracing objects in the heap? Are array lengths part of the set of protected fields that mustn't be read out of order? Strings, since they are commonly used for security checking? And so on.
It is mainly the deep questions around #4, and also some simple compatibility struggles (around things like double checked locking), that caused the stronger answers for #1 in the CLR 2.0.
In any case, I'm advocating a very different approach than the traditional models.
We pick completely weak ordering for ordinary loads and stores, to enable efficient execution on weaker platforms like ARM, PowerPC, IA64, etc. That part isn't new. But here's the clincher. No volatiles. There are special variables that are used to communicate between threads (call them volatile if you'd like), but using them implies no kind of special automatic fencing. Instead, whenever accessing such a variable, at the site of usage, the kind of fence desired must be used (compiler-enforced): full-fence (sequentially consistent), acquire-fence, release-fence, no-fence, or compiler-only-fence (for things like ensuring loads don't get hoisted as loop invariant). Of course, certain kinds of fences are sprinkled throughout the system to guarantee type safety in all of the aforementioned places (and more), but these are implementation details.
(This approach is rather like Herb Sutter's Prism and C++0x atomics. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm.)
Particularly after managing teams who developed a plethora of lock free code, I love this approach. I can review code and immediately understand what ordering invariants the developer assumed when writing the code. This doesn't really make writing lock free code any simpler, except that it forces you to pause and think about things a bit more carefully than you may have otherwise. But it certainly makes code easier to understand and maintain, and makes it clear to people that sprinkling volatile all over the place isn't going to save your butt: the only thing that will do that is careful thinking and engineering.
 Thursday, June 04, 2009
An interesting alternative to reader/writer locks is to combine pessimistic writing with optimistic reading. This borrows some ideas from transactional memory, although of course the ideas have existed long before. I was reminded of this trick by a colleague on my new team just a couple days ago.
The basic idea is to read a bunch of state optimistically, without taking a lock of any sort, and then prior to using it for meaningful work (which may depend on the state being consistent and correct), a validation step must take place. This validation uses version numbers which writers are responsible for maintaining. Specifically, we'll use two version counters, version1 and version2: the writer increments version1, performs the writes, and then increments version2; and the reader reads version2, performs its reads, and then verifies that version 1 is equal to the version2 that it saw at the start. If this verification fails, we'll ordinarily just do a little spinning and then go back around the loop again.
Stop for a moment and ponder something very critical to this algorithm. The writer increments variables in the opposite order of the reader's reads. To see why this works, imagine we start with version1 == version2 == 0. There are two hazards to worry about. (1) A reader begins reading, and writes occur before it has finished. And (2) a reader begins reading while a write is in progress. These are simple to detect, and in fact boil down to the same thing. A reader sees version2 == 0, and the first thing a writer does is version1++. So when the reader attempts to validate, it will notice the version2 it saw != version1 any longer. If the writer has already begun by the time the reader arrives, it is possible for the reader to know it is doomed even before it has started doing any of its reads.
Here is the code in its full glory:
using System;
using System.Threading;
public class OptimisticSynchronizer
{
private volatile int m_version1;
private volatile int m_version2;
public void BeforeWrite() {
++m_version1;
}
public void AfterWrite() {
++m_version2;
}
public ReadMark GetReadMark() {
return new ReadMark(this, m_version2);
}
public struct ReadMark
{
private OptimisticSynchronizer m_sync;
private int m_version;
internal ReadMark(OptimisticSynchronizer sync, int version) {
m_sync = sync;
m_version = version;
}
public bool IsValid {
get { return m_sync.m_version1 == m_version; }
}
}
public void DoWrite(Action writer) {
BeforeWrite();
try {
writer();
} finally {
AfterWrite();
}
}
public T DoRead<T>(Func<T> reader) {
T value = default(T);
SpinWait sw = new SpinWait();
while (true) {
ReadMark mark = GetReadMark();
value = reader();
if (mark.IsValid) {
break;
}
sw.SpinOnce();
}
return value;
}
}
We leave it to the caller of this class to acquire locks as appropriate to synchronize writers. Typically this will just mean wrapping a Monitor.Enter/Exit around calls to things like BeforeWrite, AfterWrite, and DoWrite. But readers explicitly do not need this same protection. DoRead exemplifies the safe reading pattern, although it can be done by hand via the ReadMark APIs.
It's also worth considering what kinds of fences are truly required for this to work. Logically speaking, we need to ensure the entrance to a protected block (either read or write) is an acquire fence, and exit from the block is a release fence. This is similar to the ordering semenaitcs necessary for a lock block. So long as we use volatile modifiers for the version counters, and for the variables read within the protected block, this will work fine. Even on weak models like IA64. The beautiful thing is that we don't need full fences, even on models like X86 that make use of store buffer forwarding The classic store buffering case we may worry about (on a single-threaded execution) would be something like this, in pseudo-code:
version1++;
X = 42;
Y = 99;
version2++;
tmp = version2;
r0 = X;
r1 = Y;
success = (tmp == version1);
We'd be worried about satisfying some loads out of the store buffer, while satisfying others out of the memory system. But this is safe: if the load of X or Y sees a different processor's writes, then the subsequent load of version1 necessarily must witness the new value written by the other processor too. And therefore the validation will fail as we would expect and hope.
Here is a quick performance benchmark I whipped together, much in the same spirit as my previous reader/writer lock examples. I've measured varying numbers of writers (0%, 5%, 10%, 25%, 50%, and 100%), and each thread spends a certain amount of time inside the "lock region" doing some nonsense busy work. The certain amount of time is measured in terms of number of function calls (0, 10, 100, and 1000), and the work doesn't vary at all depending on whether a thread is reading or writing. I've measured four things: (1) Monitor.Enter/Exit as the baseline (where both readers and writers just acquire the mutually exclusive lock), (2) ReaderWriterLockSlim, (3) the spin-based lock that I showed previously, and (4) the new OptimisticSynchronizer class with optimistic retry. The values are the ratio compared to the baseline (1), so that >1.0x means the particular entry is slower, while <1.0x is faster. I did these measurements on an 8-way machine -- unlike the previous study which was on a 4-way machine -- which means that 0.125x would be a linear speedup compared to the serialized Monitor version:
0% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.26 1.55 1.39 0.38
SpinRWL 0.12 0.17 0.13 0.18
OptSync 0.05 0.08 0.11 0.12
5% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.36 1.70 1.40 1.07
SpinRWL 0.98 1.07 0.55 0.30
OptSync 0.35 0.43 0.31 0.24
10% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.42 1.66 1.23 1.06
SpinRWL 1.41 1.61 0.91 0.51
OptSync 0.56 0.66 0.46 0.31
25% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.36 1.97 1.24 1.03
SpinRWL 2.39 2.22 1.08 0.89
OptSync 0.84 0.99 0.86 0.59
50% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.48 1.80 1.21 1.05
SpinRWL 3.16 3.30 1.81 1.19
OptSync 0.91 0.94 1.10 0.92
100% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.35 1.67 1.22 1.09
SpinRWL 5.84 5.84 2.49 1.18
OptSync 0.93 0.99 1.13 1.17
For all cases but the 100% writers case, the OptimisticSynchronizer class does extraordinarily well.
Although this approach screams performance-wise, it is admittedly much more difficult and error-prone to use. If the variables protected are references to heap objects, you need to worry about using the read protection each time you touch a field. Just like locks, this technique doesn't compose. As with anything other than simple locking, use this technique with great care and caution; although the built-in acquire and release fences shield you from memory model reordering issues, there are some easy traps you can fall into. And as with any optimistic reading, memory safety is a necessity; trying to use these techniques in C++, for example, can easily lead to access violations and memory corruption. Tread with caution.
Update 6/4: This technique, of course, is subject to ABA problems. I failed to mention that originally. That is, if between reading version2, Int32.MaxValue writers perform writes, the version1 field will wrap around such that the reader will (erroneously) successfully validate. Fixing this on 64-bit is simple (use a 64-bit counter) but is less so on 32-bit due to the lack of atomicity on loads and stores of 64-bit values (without using, say, an XCHG or related primitive).
Update 6/18: My original write-up incorrectly made some hidden assumptions about the use of volatile. This has now been cleared up.
 Thursday, May 28, 2009
Two persons stand on a railway embankment at points A and C, exactly 500 meters apart. Lightning strikes precisely in the center of them, at point B, 250 meters away from both:
<--A----------B----------C-->
Presuming both persons are stationary, does the event (lightning strikes) occur “at the same time” from the perspective of the two persons? In our simplistic one dimensional model, the answer is Yes, precisely because the point of the lightning strike, B, is equidistant from A and C.
The person at point C may just as well be responsible for generating the event, by using some form of light rod instead of a lightning bolt supplied by nature. If the person at C lights such a rod, would the event still occur “at the same time” for both persons? Clearly No, because it will take some amount of time for the event’s occurrence to travel the distance from C to A, specifically the time it takes for light to travel 500 meters. Whereas for C it happens nearly instantaneously.
Practically speaking, this amount of time it takes for the light to travel to A will of course be so minute as to be nearly immeasurable, but nevertheless there are two separate times t and t’, the former being the actual time the rod is lit at C, and the latter being the time at which it is perceived at A. This is commonly referred to as relativity of simultaneity, introduced as Lorentz’s local time in the late 1800's and further formalized by Einstein's special theory in 1905.
Now imagine that a new person is placed at point B, equidistant from A and C, where the original lightning struck. If the person at C lights his rod, will the person at B observe the event before the person at A does? Most certainly. It takes less time for light to travel 250 meters than it does 500 meters.
Let us extend our working example a bit. Imagine again three persons, one at point A, one at point B, and another at point C. Those at B and C hold their own light rods. The person at C lights a rod, and in response to seeing C’s rod lighting, the person at B also lights a rod. The question is, must the person at A witness the light emanating from C’s rod prior to witnessing the light emanating from B’s rod? Unless the person at B’s response is truly instantaneous (which we assume is practically impossible), or unless he can see into the future (which we also assume is impossible), clearly the answer is Yes. Because the rod at B was lit in response to witnessing C’s lighting of the rod, some amount of time must have passed during the response, and the person at A will thus see C’s lighting first (or at the very least simultaneously, assuming near-impossible instantaneous response). We say B’s lighting is causally dependent on C’s lighting.
The main point here is that time is an illusion. There is no global time clock. Instead, events are not only distinguished by some monotonically increasing time value t, but also by a location which is defined by three-space coordinates. This is Minkowski’s four-dimensional space. One event occurring at coordinates { x=0, y=0, z=0, t=99 } may not appear to be simultaneous with some other event at coordinates { x=42, y=42, z=42, t=99 }, depending on the observer's location, even though both events occur at time t = 99.
Perception is relative. There is no global total ordering, only a local (relative) one.
A similar phenomenon is true of multiprocessors. In fact, nearly everything said above applies equally to them, provided that you replace “persons” with “processors”, and lighting of rods with writes to memory and witnessing of the light with reads from memory.
Multiprocessor architects must cope with the increasing bottleneck on a central memory unit, particularly due to shared memory programming. One common means of doing so is to increase the distance between processing elements and the memory units they use, padding this distance with ample levels of cache. Some processor A may have a local memory (cache) that is separate from some other processor C’s, and A’s writes to it will be visible to A before C, for example. And if some processor B sits in between them, it may notice such writes before C does. Locality matters.
Of course, memory ordering models are meant to eliminate such distances from the programmer’s mind, at least to some degree. They provide a set of rules governing the timing and ordering of events. But there is just no denying the laws of physics. My claim is that a proper ordering model ought to obey what can be derived from the special theory of relativity: no more, no less. That is, the fundamental laws of how events occur and are correlated in the real world should be mimicked. This means only two things, as far as I can tell:
- An event stream (writes) originating from a source must appear to happen in order.
- Causality is respected, in that when C causes B, it is implied that A must see C followed by B.
This turns out to be stronger than some models, but also weaker in some regards. Distance and latency are first class, embellished, and allowed. There is some cost to ensuring events leaving a locale do so in order, and that events arriving into a locale also do so in order. Given coarse enough locales, however, this cost ought to be amortized over the cost of inter-locale communication.
Notice that sequential consistency is explicitly discouraged. The ordinary store-followed-by-load ordering that I've written about many times is legal. Considering this phenomena in the context of light rods and relativity makes it clear why. Imagine that the persons at A and C light their rods simultaneously. If the person at A immediately, after lighting the rod, looks to the right to see if C has lit the rod, the answer will be No; and similarly, if the person at C immediately, after lighting the rod, looks to the left to see if A has lit the rod, the answer will also be No. Although the real reason has to do with gross details like store buffers and cache coherency, the elegant reason supported by the model is that it takes time for light to travel the distance between A and C.
I also want to point out that “memory ordering model” commonly refers to individual loads and stores, at a very low level, but just as well applies to a higher-level model such as might be found in an actor-oriented (message passing) programming language. People often believe memory ordering and interleaving goes away magically with message passing models. This is simply not true, even if instruction-level interleaving is eliminated. The granularity merely coarsens, but the problem still remains the same.
Despite the lack of sequential consistency, implementing such a model can pose challenges, due to restrictions on optimizations like pipelining and out of order execution. (Hey, at least we needn’t worry about processors moving about at different velocities, as in the more interesting parts of special relativity.) But I believe it is necessary. This price paid will be rewarded with a system that human beings can be taught to reason about as they do in the real world. Remember: I am not just talking about memory models in the traditional sense, where people are tempted to sweep the problem under the rug of "only super-developers doing lock-free programming need a model"; it matters for higher level concurrency orchestration patterns too. In the end, let us not forget: correctness and understandability trump performance optimizations for all but the most low-level systems developers, which make up less than 1% of the development population.
1. Relativity: The Special and General Theory. http://en.wikisource.org/wiki/Relativity:_The_Special_and_General_Theory 2. Time, Clocks, and the Ordering of Events in a Distributed System. http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf
 Saturday, May 16, 2009
A while back, I made a big stink about what appeared to be the presence of illegal load-load reorderings in Intel's IA32 memory model. They specifically claim this is impossible in their documentation. Well, last week I was chatting with a colleague, Sebastian Burckhardt, about this disturbing fact. And it turned out he had recently written a paper that formalizes the CLR 2.0 memory model, and in fact treats this phenomenon with a great deal of rigor:
Verifying Compiler Transformations for Concurrent Programs http://research.microsoft.com/pubs/76524/tr-2008-171-latest-03-11-09.pdf
To jog your memory, the problematic example is
X = 1; r0 = X; r1 = Y;
where both X and Y are shared memory locations, and r0 and r1 are processor registers. According to Intel's IA32 memory model, two loads to different locations cannot reorder. But it is completely possible for the load of X to be satisfied out of the store buffer, and for r1=Y to pass the store (thereby also passing the load r0=X). This is a standard Dekker reordering, but the usual example consists of just { X = 1; r1 = Y }.
The key to modeling this is to turn an adjacent store-load affecting the same location into a single instruction. Therefore, the above becomes something like:
r0 = 1; X = r0; r1 = Y;
Now it becomes entirely clear what has gone wrong. I have yet to see a clear description of this phenomenon, but Sebastian's paper does a great job.
During the discussion, Sebastian showed me another disturbing four processor example:
P0 P1 P2 P3 == == == == X = 1; r0 = X; Y = 1; s0 = X; r1 = Y; s1 = Y;
Is it possible, after all four processors complete, that { r0 == 1, r1 == 0 } and { s0 == 0, s1 == 1 }? This seems ridiculous, given a memory model where loads cannot reorder. It seems that no serializable execution should lead to this. But let's look at one problematic interleaving. First, we merge the instruction stream on P0 with P1, and also P2 with P3. This effect could occur if these writes are in functions that end up running on the same processor, or running on a machine that shares functional units (like hyperthreading), hierarchies that share a cache, and so on. We end up with:
P0/P1 P2/P3 ===== ===== X = 1 Y = 1; r0 = X; s0 = X; r1 = Y; s1 = Y;
Now let's permute these with the new rule introduced above in mind:
P0/P1 P2/P3 ===== ===== r0 = 1; s0 = X; r1 = Y; s1 = 1; X = r0; Y = s1;
At this point, it should be obvious what the problematic reordering would be. Let's continue merging these into a single execution order:
P0/P1/P2/P3 =========== r0 = 1; // #1 r1 = Y; // #0 s0 = X; // #0 s1 = 1; // #1 X = r0; // #1 Y = r1; // #1
The outcome? { r0 == 1, r1 == 0 } and { s0 == 0, s1 == 1 }. Whoops.
I have yet to observe this happening in practice, but models that permit store buffer forwarding are fundamentally vulnerable to this reordering. The solution here is the same as with Dekker. Marking the volatiles is insufficient: you need to insert full memory fences between the store-load adjacent pairs.
As we were hard at work creating PFX, we had a sister team of great talent working with us every step of the way. Their job? To do to Visual Studio 2010 what PFX did to .NET 4.0, by substantially improving the development experience for parallel programming on Windows. This includes both diagnostics & debugging, as well as profiling.
Daniel Moth, the program manager for a lot of the IDE features, just wrote up a comprehensive blog post on the new tasks window:
Parallel Tasks - new Visual Studio 2010 debugger window
The new window gives you a view into all of the tasks in your process, their statuses, and where they are running:

Because both TPL and PLINQ use tasks for execution, it supports both quite nicely. And it has (consistent!) support for both .NET and C++ tasks. The parallel stacks window is also an impressive new feature, and I'm guessing Daniel will also cover that in the coming weeks. Keep your eyes peeled. If all goes well, you'll even get to try them out first-hand, once Beta1 is available.
And if that weren't enough to entice you to visit his blog, check out this nasty machine that Daniel uses to run his kitchen appliances:

Oh, the insanity. I am thinking Task Manager will need revising in Windows 8.
 Friday, May 08, 2009
The parallel computing team just shipped an early release Axum (fka Maestro), an actor based programming language with message passing and strong isolation.
I'm personally very excited to see what comes of Axum. It's one step on the long road towards the vision of automatic parallelism. Although I can't claim credit for anything concrete, I was the chief designer of the fine grained isolation model Axum is built atop (something I call "Taming Side Effects" (TSE)). It's a blend of functional programming with imperative programming enabled by using the concepts of Haskell's state monad in a more familiar way. I'll try to blog a bit more about it in coming weeks. It turns out I've recently shifted my focus to a new project with the aim of applying these ideas very broadly for a whole new platform.
Doing incubation work at Microsoft is tough work, because it takes a strong vision and drive to keep pushing forward. You need to take stances that are unconventional, risky, and often just plain unpopular, and drive against all odds. Usually you aren't going to make any money off the ideas for years at a time, so it also takes a supportive management team who is willing to give you creative freedom and cut you checks. Most such efforts fail in a vaccuum. But hats off to the team for pushing hard, and going out early to ask what developers think. This is a huge milestone.
 Tuesday, March 31, 2009
I often speak of the need to develop programming models whereby developers write code in the most natural way possible, and it just so happens to be inherently parallel. I don’t believe the lion’s share of developers want to rearchitect and rewrite their code with parallelism at the forefront of their development process. They don’t want to think about shipping memory over to the GPU and launching a highly-specialized data parallel kernel of computation, nor do they want to have to add locks and transactions everywhere to make things safe. But I do, however, believe the lion’s share of developers wouldn’t mind if their code ran faster as hardware got faster (via more cores).
(To be clear, there are certainly a lot of developers who will be happy to write specialized code if it means eking out every last bit of performance on their machine. But that’s the minority.)
This viewpoint tends to get a lot of skeptical looks from people who quickly point out that this has been tried countless times before, and always leads to failure. They, of course, are referring back to the 80’s and early 90’s where “dusty deck” parallelization was all the rage, mostly in the realm of vectorization and HPC. To be fair, there were some mild successes in getting floating point loops parallelized, but there’s no wonder these attempts had little longevity. No touch solutions are always inadequate. Trying to make a fundamentally non-secure program secure, by way of, say, virtualization, may work in some constrained circumstances. But the right solution is to develop models and practices that lead to security-by-construction.
Furthermore, languages were (and in most cases still are) lacking some major prerequisites for large-scale automatic parallelization:
- Safety. Unless a compiler can reason about the determinism of a program when run in parallel, one cannot prove that its results will remain correct when parallelism is added. Compilers are therefore limited to parallelizing highly-specialized recognized patterns, like loops comprised solely of floating point additions of two arrays indexed by the loop iteration.
- Performance. Rampantly parallelizing a huge program wherever possible is dangerous, will drain performance, and make power consumption skyrocket. Dynamic techniques like workstealing and static techniques like nested data parallelism and profile guided feedback need to work together to inform these decisions. Machine-wide resource management needs to know about the memory topology, machine load and policy, and make informed decisions based on them. Although there has been a lot of disparate research in these areas over the years, they have only recently been coming together. Certainly in the 80’s, they were in their infancy.
- Declarative patterns. Most of the prior art was done in FORTRAN, a standard imperative language riddled with loops, effects, and assignments. Programs need to be written with as few dependencies as possible in order to expose large amounts of parallelism, and the von Neumann inspired languages fall short of this aim. Data comprehensions allow set-at-a-time computations to be expressed in a higher-order way, and newer languages like Fortress have language semantics that permit parallel evaluation in many more areas, like argument evaluation. And application models that encourage isolation and loose state coupling allow coarse partitioning.
In addition to all of those three things, we must have realistic expectations. Even if a program were fully safe to parallelize, as many Haskell kernels are, we would seldom see perfect scaling. Buying a 128-core machine doesn’t necessarily give you a 128x speedup. Why? Because there are still portions of the code that will end up less parallel than other portions, and some parts may even continue to run sequentially. There will always be I/O and waiting: these are real programs, after all, and real programs tend not to be 100% computation. They need to do something useful with the real world. Moreover, safety does not mean “dependence free.” And data dependencies are ultimately what limit parallelism.
My stated goal would therefore be that parallelism in programs is solely limited by data dependence. Safety issues are handled by construction. Performance is (mostly) handled by the system, although as with all things, there will be some amount of measurement, hints, and tuning necessary. But hopefully a huge part of tuning performance will be seeking out needless dependencies, or finding new algorithms that have different dependence characteristics. And with that, we can focus our energy on raising the level of abstraction and pushing more declarative patterns that are broadly useful. Over time as more and more programs are written in this fashion, they become more and more naturally parallel.
What do you think? Am I crazy? Perhaps. But I still know we can do it.
 Friday, March 13, 2009
Managed code generally is not hardened against asynchronous exceptions.
“Hardened” simply means “written to preserve state invariants in the face of unexpected failure.” In other words, hardened code can tolerate an exception and continue being called subsequently without a process or machine restart. Conversely, code that is not hardened may react sporadically if continued use is attempted: by corrupting state and subsequently behaving strangely and unpredictably.
Asynchronous exceptions are a foreign concept to native programmers, and arise because there is a runtime underneath all managed code that is silently injecting code on behalf of the original program. The only truly asynchronous exception is ThreadAbortException, but any in the set { OutOfMemoryException, TypeInitializationException, ThreadInterruptedException, StackOverflowException } are often labeled as such. While thread aborts can happen at any line of code outside a delay-abort regions, these other exceptions can be introduced by the CLR at surprising times; i.e., { memory allocations, static member access, blocking calls, any function call }. The effect is that, unlike most exceptions, the points at which they may occur are not obvious. OOMs, for instance, can happen at any method call (due to failure to allocate memory in which to JIT code), implicit boxing, etc.
(As of 2.0, StackOverflowException is no longer relevant because SO triggers a FailFast of the process instead. So saying that managed code is not hardened against SO is an understatement.)
Also, because of the way COM reentrancy works, any blocking call can lead to any arbitrary code dispatched through STA pumping. And that arbitrary code, much like an APC, can fail via any arbitrary exception. These are a lot like asynchronous exceptions. So in truth, code that isn’t written to respond to arbitrary exceptions at all blocking points is technically not hardened either.
.NET doesn’t provide checked exceptions, so the blunt reality is that very little managed code is hardened properly to synchronous exceptions either. I think we do a better job in the framework of carefully engineering the code to resiliently tolerate failure, usually by being very careful about argument validation, but we aren’t perfect. Some things slip through.
If you stop to think about why hardening isn’t done, it’s probably obvious. It’s darn difficult. Especially for asynchronous exceptions where nearly every line of code must be considered. In Win32 programming, most failure points are indicated by return codes. (Although C++ exceptions can sneak through the cracks at surprising times. Like the fact that EnterCriticalSection can throw.) While error codes are cumbersome to program against (since every call needs to be checked for a plethora of conditions, making it easy to miss something), at least the response to failure is explicit. You can decide to propagate and leave state corrupt, fix up state and then propagate, rip the process, or ignore the failure, as appropriate. This becomes part of the API contract. In managed code, you need to know to wrap such calls in try/catch blocks. Nobody does this. It’s insane to even consider doing that. And because nobody does, you can’t even catch exceptions coming out of a single API call and know that, when faced with an OOM (for example), that all code on the propagating callgraph has transitively handled the failure in a controlled manner. The very fact that the lock{} statement auto-unlocks without rolling back corrupt state should be indication enough of the current state of affairs.
An instance of any of the aforementioned exceptions means the AppDomain is toast.
By toast, I mean that it’s soon going to be unusable, and hopefully actively being unloaded. Code in the framework assumes this, and you should too. All it does is try to get out of the way by not crashing or hanging the ensuing unload. A small fraction of code that deals with process-wide state comprised of resources not under the purview of the CLR GC needs to worry about running and avoiding leaks. This is where things like CERs, CriticalFinalizerObjects, and paired operations stuck in finally blocks come into play. They ensure cross-process state is freed, and that asynchronous exceptions cannot occur in places that would crash or hang a clean unload.
Unfortunately, it’s not always the case that the AppDomain is unloading when such an exception occurs:
- Somebody can call Thread.Abort directly, without killing the AppDomain. They can either call ResetAbort and keep it around, or let it return to the ThreadPool which catches and swallows aborts. In fact, we tell people that synchronous aborts a la Thread.CurrentThread.Abort is “always safe”, whereas we tell people asynchronous aborts are dangerous and best avoided.
- Some framework infrastructure, most notably ASP.NET, even aborts individual threads routinely without unloading the domain. They backstop the ThreadAbortExceptions, call ResetAbort on the thread and reuse it or return it to the CLR ThreadPool. That means any code running in ASP.NET is apt to be corrupted when websites are recycled and AppDomain isolation is not being used.
- Assume AppDomain B is being unloaded. If some thread has called from A to B to C, the thread will immediately suffer an abort. The result is that C will see a thread unwinding with a ThreadAbortException, back into B, and then back to A, at which point the exception turns into a deniable AppDomainUnloadedException that can be caught. But C has seen an in-flight abort and yet it is not being unloaded. The result is that C’s state may be completely corrupt. I believe this should be considered a bug in the CLR.
- We can’t differentiate between soft- and hard-OOMs today. The former are caused by requests to allocate large blocks memory. Often a failure here isn’t indicative of a disaster. It may be due to a need to allocate 1GB of contiguous memory, and perhaps there is fragmentation. Hard OOMs are often caused by running up against the edge of the machine where no pagefile space is available, and may indicate a failure to JIT some important method, among other things. But because we don’t differentiate, any managed code can catch-and-ignore any kind of OOM, including hard ones.
- Thread interruptions are often used as a form of inter-thread communication. For example, they can be used as a poor man’s cancellation. (This is inappropriate, and cooperative techniques should always be used. But it is widespread.) But because they are used as a means of communication, they are almost always caught and handled in some controlled manner. This is one place where we screwed up by not hardening the frameworks against interrupted blocking calls and reacting intelligently. Checked exceptions would have saved us.
What does all of this mean? Quite simply, the .NET Framework cannot be trusted when any of the aforementioned exceptions are thrown. Ideally the process will come tumbling down shortly thereafter, but improperly written code can catch them and continue trying to limp along. In fact, as I mentioned above, some wildly popular & successful application models do (notably, ASP.NET and WinForms).
This state of affairs is admittedly unfortunate. We don’t properly separate out the truly fatal exceptions from those that we can gracefully recover from. In an ideal world, I’d love to see us do that. For example:
- At some point, we really ought to consider FailFast instead of continuing to run code under failures we know are fatal and dangerous to attempt to recover from, much like we do with SO. At least these failures should be undeniable like thread aborts are. But this is a fairly Byzantine response and is not for the faint of heart. Given that we still live in a world where WinForms wraps the top-most frame of the GUI thread in a catch-all, presents a dialog box, and allows a user to click “Ignore & Continue”, I seriously doubt we’ll get there anytime soon.
- Never expose a ThreadAbortException to code in an AppDomain unless we can guarantee the AppDomain is being unloaded. That means getting rid of the Abort API, and thus indirectly disallowing code from catching and calling ResetAbort. It also means the A calls B calls C case would not allow B to unload until the thread voluntarily unwinds out of C.
- Allow OOMs to be caught only when they are soft. That means a call to ‘new’, and it means the catch much occur inside the same stack frame as the call to ‘new’. Such exceptions can be tolerated if code is properly written, and we will tell developed to be mindful of them. Once such an OOM propagates past the calling stack frame, they will escalate to hard.
- All other OOMs are hard and fatal. This includes failure to allocate memory to JIT code and failure to allocate 20 bytes to box an int. Hard OOMs are thus undeniable.
- Get rid of ThreadInterruptedExceptions. We screwed this up from Day One, and it’s probably too late to fix this. We added cooperative cancellation in .NET 4.0 for a reason.
- TypeInitializationExceptions can probably stay, but we should allow rerunning the cctor upon subsequent accesses. Today, once a class C throws from its cctor, the class can never be constructed. So on the current plan, it only makes sense to FailFast.
I’m sure there are many other things we could do to improve things. But these 6 general themes would be a great start.
I’m just spitballing here. There are no concrete plans to do any of these 6 things as far as I know. And at the end of the day, hardening only improves the statistics of the situation, so it tends to be very difficult to argue for one change over another, particularly if taking the change would make existing programs break. But I really would like to see the base level of reliability in managed code improve with time. Especially with the exciting work going on around contract-checking in the BCL in Visual Studio 10, I hope these topics become top-of-mind for folks again in the near future.
 Monday, February 23, 2009
Pop quiz: Can this code deadlock?
SpinLock slockA = new SpinLock();
SpinLock slockB = new SpinLock();
Thread 1 Thread 2
~~~~~~~~ ~~~~~~~~
slockA.Enter(); slockB.Enter();
slockA.Exit(); slockB.Exit();
slockB.Enter(); slockA.Enter();
slockB.Exit(); slockA.Exit();
The answer, as I'm sure you suspiciously guessed, is "it depends."
I previously posted some thoughts about whether a full fence is required when exiting the lock. In that post, I focused primarily on timeliness. But what might be even more frightening is that the answer to my question above is yes, provided two things:
1. Exit doesn't end with a full fence. 2. Enter doesn't start with a full fence.
Just making Exit a store release and Enter a load acquire is insufficient. Here's why.
Imagine a super simple spin lock that satisfies our deadlock criteria:
class SpinLock {
private volatile int m_taken;
public void Enter() {
while (true) {
if (m_taken == 0 &&
Interlocked.Exchange(ref m_taken, 1) == 0)
break;
}
}
public void Exit() {
m_taken = 0;
}
}
Clearly Exit satisfies #1. The technique of using an ordinary read of m_taken before resorting to the XCHG call is often known as a TATAS (test-and-test-and-set) lock, and this can help alleviate contention. And it also means we will satisfy #2 above.
To see why deadlock is possible, imagine the following (fully legal) compiler transformation. The compiler first inlines everything, so for Thread 1 we have:
Thread 1
~~~~~~~~
while (true) {
if (slockA.m_taken == 0 &&
Interlocked.Exchange(ref slockA.m_taken, 1) == 0)
break;
}
slockA.m_taken = 0;
while (true) {
if (slockB.m_taken == 0 &&
Interlocked.Exchange(ref slockB.m_taken, 1) == 0)
break;
}
slockB.m_taken = 0;
What has to happen next is pretty subtle. It's even unlikely a compiler would do this intentionally (as far as I can tell). But it's entirely legal to morph the above code into something like this:
Thread 1
~~~~~~~~
while (true) {
if (slockA.m_taken == 0 &&
Interlocked.Exchange(ref slockA.m_taken, 1) == 0)
break;
}
while (slockB.m_taken == 0) ;;
slockA.m_taken = 0;
if (Interlocked.CompareExchange(ref slockB.m_taken, 1) != 0)
while (slockB.m_taken != 0 ||
Interlocked.Exchange(ref slockB.m_taken, 1) != 0) ;;
slockB.m_taken = 0;
The load(s) of slockB.m_taken have moved before the store to slockA.m_taken; this is legal, even if they are both marked volatile. A load acquire can move above a store release, and the code remains functionally equivalent. Now, the code required to fix up this code motion is pretty hokey. We clearly can't do the XCHG before the store to slockA.m_taken, so we need to try it afterwards. But that brings about an awkward transformation: if it fails, we must effectively do what the original code did, spinning until we acquire the slockB lock.
Do you see the deadlock yet?
Imagine the compiler did similar code motion on Thread 2:
Thread 2
~~~~~~~~
while (true) {
if (slockB.m_taken == 0 &&
Interlocked.Exchange(ref slockB.m_taken, 1) == 0)
break;
}
while (slockA.m_taken == 0) ;;
slockB.m_taken = 0;
if (Interlocked.CompareExchange(ref slockA.m_taken, 1) != 0)
while (slockA.m_taken != 0 ||
Interlocked.Exchange(ref slockA.m_taken, 1) != 0) ;;
slockA.m_taken = 0;
Oh no! See it now?
If Thread 1 and Thread 2 both enter the critical regions for slockA and slockB at the same times, they will end up spin-waiting for the other to leave before exiting their respective lock.
Boom: deadlock.
 Sunday, February 22, 2009
A few weeks back I recorded a discussion with the infamous Erik Meijer and Charles from Channel9.
Perspectives on Concurrent Programming and Parallelism http://channel9.msdn.com/shows/Going+Deep/Joe-Duffy-Perspectives-on-Concurrent-Programming-and-Parallelism/
In it, I show my cards a bit more than intuition says I should. I'm not good at poker.
To summarize:
- Mostly functional (purity + immutability) is a great default.
- Safe, determinstic mutability (a la runST) is a must-have for cognitive familiarity.
- Isolation is key to achieve the former; type systems can help (a lot).
- Actors, agents, forkIO, <what have you> is a good model, but not the only one. Isolation is (far) more general.
- Transactions can help around the edges.
I'm working on a few papers for public consumption this year where I espouse these ideas. Keep watching for more detail.
 Friday, February 20, 2009
I was very harsh in my previous post about reader/writer locks.
The results are clearly very hardware-specific. And one can certainly argue that better implementations are possible. (In fact, I will show one momentarily.) But no matter which way you slice-and-dice it, a lock implies mutable shared state which implies contention. Herb argued this point quite well, and rather thoroughly, in his recent Dr. Dobb’s article. Interference due to contention means more time spent resolving memory conflicts and less time doing useful work. A reader/writer lock can be infinitely clever, but there is still a consensus protocol that must be established: and that implies a loss of scalability. Pretty simple.
It’s very tricky to develop a consensus protocol that is sufficiently lossy so as to relieve memory contention while at the same time being sufficiently precise that the lock works right. In the case of a spinning reader/writer lock (which is, for what it’s worth, overly naïve an approach for most circumstances), you need to ensure that a writer knows for sure when there are 0 readers, and that each reader knows for sure whether there is 0 or 1 writer. (For blocking reader/writer locks, there’s a whole lot more.) One promising thing to note is that the writer only needs to know whether there are 0 or N readers, but not the specific value N; there’s a fair bit of research on scalable counters (like this) which exploit problems of this nature. Unfortunately, it’s not completely relevant here. You need to know exactly when the transition from N to 0 readers happens in order to let the writer through in a timely fashion; and in order to account for that transition, a consensus among readers is needed. That's hard to do.
More scalable solutions are possible than the simple lock I showed previously. Although writers need to know whether readers are present, the readers themselves could care less about other readers. As a result, we can make the lock slightly more expensive for the writer, because it needs to accumulate the count of readers, but this allows us to make it it slightly cheaper for the readers to enter and exit. Where cheaper means less contention.
Here’s one possible algorithm. We’ll keep an array of read flags and a single write flag:
private volatile int m_writer;
private ReadEntry[] m_readers = new ReadEntry[Environment.ProcessorCount * 16];
A few things are noteworthy about the read flags.
First, it’s an array of ReadEntry values. These are just simple structs that wrap a volatile int, but we also pad the struct so that it’s 128 bytes in total size. That avoids the situation where multiple read flags just happen to end up sharing the same cache line (which are usually either 64 or 128 bytes in size), which leads to false sharing in the memory system (destroying our aim to reduce contention).
[StructLayout(LayoutKind.Sequential, Size = 128)]
struct ReadEntry {
internal volatile int m_taken;
}
Second, we size the array to be 16-times the number of processors. We hash into it based on the calling thread’s unique identifier, so to reduce (but not eliminate) the chance of hashing collisions, we’ll use a few times more buckets than the total number of concurrent threads. Hashing collisions are expensive: they incur some amount of memory contention, and also demand that we use an atomic CAS increment instead of an ordinary ++. (While a super-duper-cheap TLS solution might seem more ideal, there isn’t any good per-object TLS solution to use. The array hashing approach is actually quite fast.)
Notice that we’re using an awful lot of space for a single lock. This means the techniques I show here wouldn’t be readily applicable to a system that uses lots of fine-grained locks, like transactional memory. But similar ideas can be extrapolated, e.g., by using shared lock tables.
Lastly, some invariants among these fields are self-evident. When the writer flag is 0, no writers are waiting; when it is 1, either a writer is actively in the critical section, or there is a writer waiting for readers to exit. When at least one reader flag entry is non-0, there is a reader either inside the lock or attempting to enter it. Thus, no new writer is permitted while there’s a non-0 reader entry, and no new reader is permitted while there’s a non-0 writer flag. This is sufficient to ensure the reader/writer lock properties hold.
Now let’s look at how the EnterReadLock and ExitReadLock methods work.
When a reader arrives, it spins until the writer flag is non-0. It then hashes into the read flag array using its unique thread identifier, and then atomically increments the read counter. It then needs to recheck that a writer didn’t arrive in the meantime. (The CAS increment means we can safely do this without worry for reordering bugs, like the read of the writer flag passing the write to the reader flag.) If a writer hasn’t arrived, the read lock has been successfully acquired and we’re done; if a writer has arrived, however, the reader needs to back out the change (since the writer might be waiting for the read flag to become 0) and then go back to spinning. It will retry again once the writer exits.
private int ReadLockIndex {
get { return Thread.CurrentThread.ManagedThreadId % m_readers.Length; }
}
public void EnterReadLock() {
SPW sw = new SPW();
int tid = ReadLockIndex;
// Wait until there are no writers.
while (true) {
while (m_writer == 1) sw.SpinOnce();
// Try to take the read lock.
Interlocked.Increment(ref m_readers[tid].m_taken);
if (m_writer == 0) {
// Success, no writer, proceed.
break;
}
// Back off, to let the writer go through.
Interlocked.Decrement(ref m_readers[tid].m_taken);
}
}
(Note that SPW is a little type to encapsulate the spin-wait logic, including some amount of backoff to reduce contention. An example implementation at the bottom of this essay, along with the full reader/writer lock code. .NET 4.0 includes a SpinWait type that provides this same functionality.)
Exiting the read lock is pretty simple. We just need to decrement our counter.
public void ExitReadLock() {
// Just note that the current reader has left the lock.
Interlocked.Decrement(ref m_readers[ReadLockIndex].m_taken);
}
The writer lock is pretty straightforward. It works the same way most spin-based mutually exclusive locks work, but using a CAS on the writer flag, but has an extra step after successfully acquiring the lock: a writer must walk the list of read flags, and wait for each one to become 0. (This is similar to Peterson's mutual exclusion algorithm for N-threads.) Because the write flag is set first (using a CAS), and because new readers won’t enter if the flag is set, we can be assured this works correctly without hokey memory reordering problems cropping up.
public void EnterWriteLock() {
SPW sw = new SPW();
while (true) {
if (m_writer == 0 && Interlocked.Exchange(ref m_writer, 1) == 0) {
// We now hold the write lock, and prevent new readers.
// But we must ensure no readers exist before proceeding.
for (int i = 0; i < m_readers.Length; i++)
while (m_readers[i].m_taken != 0) sw.SpinOnce();
break;
}
// We failed to take the write lock; wait a bit and retry.
sw.SpinOnce();
}
}
And exiting the write lock is even simpler than exiting the read lock. We just set the writer flag to 0.
public void ExitWriteLock() {
// No need for a CAS.
m_writer = 0;
}
Given all of that, you might wonder how well this bad boy performs. Well, single-threaded performance is a bit worse than the previous spin reader/writer lock: about 1.55x the cost of a monitor acquisition for the read lock instead of 0.95x, and about 5.52x for the write lock instead of 0.85X. This makes sense. There’s simply a whole lot more work going on in this new lock compared to the old, simple one.
But scalability is vastly improved. Our hard work has apparently paid off. Here’s a table much like the one in the previous post: scaling over the equivalent mutually exclusive monitor code, for various percentages of writers and various amounts of "work" (counts of function calls) inside the lock region. (I have left out the legacy .NET ReaderWriterLock type because it is embarassingly terrible.) Remember: 1.0x means it scales the new lock is the same as monitor, 0.5x means twice as fast, and 2.0x means twice as slow. 0.25x is ideal speedup (4x) since I am running the tests on a four way machine.
0% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.11x 2.01x 0.96x 0.32x
SpinRWL(old) 9.63x 7.04x 1.02x 0.26x
SpinRWL(new) 0.39x 0.36x 0.28x 0.25x
5% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.29x 2.36x 1.18x 0.61x
SpinRWL(old) 5.69x 5.59x 1.43x 0.94x
SpinRWL(new) 1.01x 0.96x 0.45x 0.38x
10% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.26x 2.04x 1.15x 1.00x
SpinRWL(old) 6.87x 5.03x 1.42x 1.34x
SpinRWL(new) 1.60x 1.51x 0.63x 0.53x
25% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.09x 2.10x 1.14x 1.00x
SpinRWL(old) 4.70x 4.20x 1.43x 1.69x
SpinRWL(new) 2.81x 2.29x 1.27x 0.73x
50% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.18x 1.95x 1.15x 0.95x
SpinRWL(old) 3.23x 3.73x 1.54x 1.39x
SpinRWL(new) 3.16x 2.76x 1.73x 1.10x
100% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.18x 1.95x 1.04x 0.92x
SpinRWL(old) 2.63x 2.04x 1.06x 0.87x
SpinRWL(new) 6.79x 3.96x 1.62x 1.06x
You can see there are now several more cases where the new reader/writer lock beats out both the .NET 3.5 ReaderWriterLockSlim type in addition to our previous attempt. In fact, we now have a few new scenarios that scale, like 5% or 10% writers where the amount of work being done is at least 100 function calls. (Unfortunately, doing 100 or more function calls inside a lock that uses spin-waiting is dangerous and considered a very bad practice: you should be able to count the number of instructions on your fingers (and toes). But that’s somewhat beside the point.) In summary, so long as there is a fair amount of work going on and the percentage of writers remains very low, we might see a benefit.
So was I overly harsh on reader/writer locks in my last post? Sure, maybe a little. While I am still very disappointed in the current .NET reader/writer locks (and, I imagine, the Vista SRWLock), the results I was able to get here are a bit more promising.
But the point I was trying to get across is the same: sharing is sharing is sharing. Avoid it like the plague.
(Thanks to Tim Harris for sending me private email about my previous posts. The brief discussion inspired me to pick this back up.)
Here’s the full code for the reader/writer lock.
using System;
using System.Threading;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Runtime.InteropServices;
// We use plenty of interlocked operations on volatile fields below. Safe.
#pragma warning disable 0420
/// <summary>
/// A very lightweight reader/writer lock. It uses a single word of memory, and
/// only spins when contention arises (no events are necessary).
/// </summary>
public class ReaderWriterSpinLockPerProc {
private volatile int m_writer;
private volatile ReadEntry[] m_readers = new ReadEntry[Environment.ProcessorCount * 16];
[StructLayout(LayoutKind.Sequential, Size = 128)]
struct ReadEntry {
internal volatile int m_taken;
}
private int ReadLockIndex {
get { return Thread.CurrentThread.ManagedThreadId % m_readers.Length; }
}
public void EnterReadLock() {
SPW sw = new SPW();
int tid = ReadLockIndex;
// Wait until there are no writers.
while (true) {
while (m_writer == 1) sw.SpinOnce();
// Try to take the read lock.
Interlocked.Increment(ref m_readers[tid].m_taken);
if (m_writer == 0) {
// Success, no writer, proceed.
break;
}
// Back off, to let the writer go through.
Interlocked.Decrement(ref m_readers[tid].m_taken);
}
}
public void EnterWriteLock() {
SPW sw = new SPW();
while (true) {
if (m_writer == 0 && Interlocked.Exchange(ref m_writer, 1) == 0) {
// We now hold the write lock, and prevent new readers.
// But we must ensure no readers exist before proceeding.
for (int i = 0; i < m_readers.Length; i++)
while (m_readers[i].m_taken != 0) sw.SpinOnce();
break;
}
// We failed to take the write lock; wait a bit and retry.
sw.SpinOnce();
}
}
public void ExitReadLock() {
// Just note that the current reader has left the lock.
Interlocked.Decrement(ref m_readers[ReadLockIndex].m_taken);
}
public void ExitWriteLock() {
// No need for a CAS.
m_writer = 0;
}
}
struct SPW {
private int m_count;
internal void SpinOnce() {
if (m_count++ > 32) {
Thread.Sleep(0);
} else if (m_count > 12) {
Thread.Yield();
} else {
Thread.SpinWait(2 << m_count);
}
}
}
 Wednesday, February 11, 2009
A couple weeks ago, I illustrated a very simple reader/writer lock that was comprised of a single word and used spinning instead of blocking under contention. The reason you might use a lock with a read (aka shared) mode is fairly well known: by allowing multiple readers to enter the lock simultaneously, concurrency is improved and therefore so does scalability. Or so the textbook theory goes.
As a purely theoretical illustration, imagine we’re on a heavily loaded 8-CPU server where a new request arrives every 0.25ms and runs for 1ms. In an ideal world, we could service requests coming in at a rate of 1ms / 8-CPUs = 0.125ms without falling behind. But imagine these requests need to access some shared state, and so there is a bit of serialization required. In fact, let’s imagine each does 0.5ms’ worth of its work inside a lock. If you were to use a mutually exclusive lock, then you’d have an immediate lock convoy on your hands. Even with 8-CPUs you won’t be able to keep up. You’ll start off gradually building up a debt, and eventually come to a crawl. Let’s examine the initial timeline:
Req# Arrival Acquire Release Wait Time
1 0.0ms 0.0ms 0.5ms 0.0ms
2 0.25ms 0.5ms 1.0ms 0.25ms
3 0.5ms 1.0ms 1.5ms 0.5ms
4 0.75ms 1.5ms 2.0ms 0.75ms
5 1.0ms 2.0ms 2.5ms 1.0ms
6 1.25ms 2.5ms 3.0ms 1.25ms
7 1.5ms 3.0ms 3.5ms 1.5ms
8 1.75ms 3.5ms 4.0ms 1.75ms
Oh jeez, after only the first 8 requests, we’ve fallen way behind.
Each new request adds 0.25ms onto the amount of time the request must wait for the lock. And it’s not going to get any better:
9 2.0ms 4.0ms 4.5ms 2ms
10 2.25ms 4.5ms 5.0ms 2.25ms
11 2.5ms 5.0ms 5.5ms 2.5ms
12 2.75ms 5.5ms 6.0ms 2.75ms
... and so on ...
By request #9, requests have to wait for twice as long as they run. Eventually something has to give, or the server will come tumbling down.
Now, imagine we used a reader/writer lock instead. Threads would never wait for each other, and we wouldn’t end up with this never-ending buildup of wait times. In other words, the “Wait Time” column above would always be 0.0ms. And because the arrival rate is less than our theoretical limit of one request per 0.125ms, our lock convoy is gone. Right?
Unfortunately, probably not; this mental model is overly naïve.
Even when a read lock is acquired, there is mutual exclusion going on:
- Some reader/writer locks actually use mutually exclusive locks to protect their own internal state, like the list of current readers! This can come as a surprise, but it’s true of the .NET reader/writer locks. Vance’s example even does and, although it uses a spin lock in an attempt to reduce the overhead, there’s still no denying that it’s mutual exclusion.
- And even if they don’t use mutually exclusive locks, like the simple spin-based one from my previous blog post, there are CAS instructions. And a CAS instruction actually amounts to a form of mutual exclusion at the hardware level, because the cache coherency machinery needs to ensure that no two processors try to acquire and modify the same cache line exclusively.
- In addition to all of that overhead, the cost in CPU cycles of acquiring the read-lock is nowhere near zero. Because of the use of locks and/or CAS internally, and the resulting cache contention and line evictions that this will cause, throughput will suffer. If there is contention, threads may end up blocked (if real locks are being used), spinning (if spin locks are being used), or simply optimistically retrying CAS’s due to line ping ponging.
The result?
Read locks are just as bad as mutually exclusive locks when lock hold times are short. In fact, they can be worse, because reader/writer locks are more complicated and therefore cost more than simple mutually exclusive locks: many need to keep track of read lists in order to disallow recursive acquires, maintain multiple event handles so certain kinds of waiters can be awakened over others, and store various kinds of counters and flags. Even my super simple single-word, spin-based reader/writer lock needed to worry about blocking out readers when a writer was waiting, properly incrementing and decrementing the reader count when readers are racing with one another (leading to more complicated CAS on the exit path than ordinary write locks), and so on.
That said, a reader/writer lock would in fact probably work in the situation above. A hold time of 0.5ms is huge, and with only 8 concurrent threads and the arrival rate we’re talking about, the overheads are apt to be quite small in relation to the work being done. Another similar setting in which reader/writer locks commonly make a noticeable difference is in the execution of large database transactions.
But the sad truth is that we tell programmers to keep lock hold times short, and most locks I see are comprised of two dozen instructions or less. So we’re in the microsecond range at the very most, which is certainly not large enough for read locks to pan out.
To illustrate this point, I wrote a little benchmark program that benchmarks the legacy .NET ReaderWriterLock, the 3.5 ReaderWriterLockSlim type, and my little spin reader/writer lock. All it does is spawn 4 threads on my dual-socket, dual-core (4-CPU) machine, and then loop around so many times acquiring and releasing a certain kind of lock. I’ve written the test so that the amount of work done inside the lock is parameterized as a certain number of non-inlined function calls. I also parameterize the percentage of acquires that will be write-locks. Then I’ve run this a bunch of times, and compared the total time taken with the equivalent code using a CLR Monitor for mutual exclusion instead.
Here are some results, where each column represents the number of function calls. The entries are the cost relative to Monitor: 1.00x means they are the same, 0.5x means the alternative lock is twice as fast, and 2.0x means the alternative lock is twice as slow. Remember, the ideal situation would be 0.25x: that is, by allowing four threads to run completely concurrently, we run four times faster.
0% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 9.23x 6.46x 0.90x 0.49x
RWLSlim (3.5) 2.11x 2.01x 0.96x 0.32x
SpinRWL 9.63x 7.04x 1.02x 0.26x
5% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 10.55x 8.23x 1.71x 0.63x
RWLSlim (3.5) 2.29x 2.36x 1.18x 0.61x
SpinRWL 5.69x 5.59x 1.43x 0.94x
10% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 20.31x 10.39x 2.34x 0.99x
RWLSlim (3.5) 2.26x 2.04x 1.15x 1.00x
SpinRWL 6.87x 5.03x 1.42x 1.34x
25% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 74.49x 49.59x 9.18x 2.15x
RWLSlim (3.5) 2.09x 2.10x 1.14x 1.00x
SpinRWL 4.70x 4.20x 1.43x 1.69x
50% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 148.34x 98.46x 20.46x 3.63x
RWLSlim (3.5) 2.18x 1.95x 1.15x 0.95x
SpinRWL 3.23x 3.73x 1.54x 1.39x
100% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 170.59x 123.66x 24.04x 4.29x
RWLSlim (3.5) 2.18x 1.95x 1.04x 0.92x
SpinRWL 2.63x 2.04x 1.06x 0.87x
Clearly there are a number of anomalies in these numbers. Why the legacy ReaderWriterLock balloons to 170X the cost of Monitor when we have 100% writers is a very interesting question indeed. Why my simple spin reader/writer lock is 9.63X when we have pure reads and 0 calls, and yet the ReaderWriterLockSlim type is only 2.11X is also interesting. And so on. The numbers are very specific to the version of .NET I am using, and indeed the precise machine configuration, including the number and layout of cores and caches.
But if we look more generally at the numbers, ignoring some of the surprising ones, we can make one interesting and safe conclusion: You need a really low percentage of writers, and a really long amount of time inside the lock, for any scalability wins to show up as a result of using a reader/writer lock. Our best case was the spin reader/writer lock when we had 0% writers and 1000 calls. But clearly if you have no writers, i.e., state is immutable, there’s little point in using any locks whatsoever! This is an extreme result, where threads are hammering on the lock constantly in a tight loop, but if you stop to think about it: When else would a reader/writer lock make a difference? If threads are just getting in and out of the lock very quickly, and arrivals are infrequent, then there is no benefit to allowing multiple threads in at once anyway.
The moral of the story? Besides suggesting that you seriously question whether a reader/writer lock is actually going to buy you anything, it's the same as the conclusion in my previous post on the matter:
Sharing is evil, fundamentally limits scalability, and is best avoided.
 Monday, February 02, 2009
I frequently get asked about the C# compiler's warning CS0420 about taking byrefs to volatile fields. For example, given a program
class P { static volatile int x;
static void Main() { f(ref x); }
static void f(ref int y) { while (y == 0) ; } }
the C# compiler will complain
xx.cs(8,15): warning CS0420: 'P.x': a reference to a volatile field will not be treated as volatile
because of the line containing 'ref x'. (The same applies to 'out' parameters too.) The natural question is, of course, whether to worry about it.
In general, the answer is yes, you must worry. In the above example, the use of the 'y' parameter inside 'f' will not be treated as volatile, as the warning says. What does that mean in practice? For one, the read of 'y' in 'f's while loop could be considered loop invariant by the JIT compiler and hoisted, and you'd possibly loop forever. It also means that on IA64 platforms, such reads will be emitted as ordinary loads instead of the special load-acquire variant that is emitted for volatile loads. This can lead to reordering bugs. In other words, you lose the volatile-ness of the field as soon as you cast it away as an ordinary byref. And unlike C++ where you can have a volatile pointer, there's no way to mark a .NET byref as volatile.
(You can use the Thread.VolatileRead and VolatileWrite methods to use a byref in a volatile manner. Unfortunately they are far more costly than ordinary volatile loads and stores.)
There is one particularly annoying case in which this warning is complete noise: when passing a byref to an API that internally performs volatile (or stronger) loads and stores. I.e., the Interlocked.*, Thread.VolatileRead, and VolatileWrite methods. Because these APIs internally use explicit memory barriers and atomic hardware instructions, the byref will effectively be treated as volatile regardless of whether it was taken from a volatile field or not. And therefore it is safe.
For instance, the compiler will warn you about the following code
volatile int x;
static void f() { Interlocked.Exchange(ref x, 1); }
even though there is no problem. You can suppress the warning with a "#pragma warning disable" just before the call
volatile int x;
static void f() { #pragma warning disable 0420 Interlocked.Exchange(ref x, 1); #pragma warning restore 0420 }
and then restore it immediately afterwards. (It's a good idea to restore the warning so that you catch other possibly-problematic instances from being missed.)
This comes up a whole lot. Why? Because many times you'll mark a field volatile, even though it is updated exclusively with CAS operations, because it's also used in other contexts: e.g., sequences where loads mustn't reorder or erroneously be considered loop invariant. I personally have a habit of always marking these variables as such, mostly as a carryover from Win32 whose InterlockedXX family of APIs demand volatile pointers (i.e., volatile * LONG).
I'm told that this annoying case might be fixed in the next C# compiler, by the way. Until then, I figured I'd throw this up for reference purposes.
 Thursday, January 29, 2009
Reader/writer locks are commonly used when significantly more time is spent reading shared state than writing it (as is often the case), with the aim of improving scalability. The theoretical scalability wins come because the lock can be acquired in a special read-mode, which permits multiple readers to enter at once. A write-mode is also available which offers typical mutual exclusion with respect to all readers and writers. The idea is simple: if many readers can read simultaneously, the theory goes, concurrency improves.
(I’ll be posting an analysis of reader/writer lock scalability in an upcoming post. For a variety of reasons--most related to my recent CAS post--they seldom make a dramatic impact in practice.)
In addition to showing up in libraries--such as Vista’s new SRWLock, .NET’s ReaderWriterLock, and .NET 3.5’s ReaderWriterLockSlim--they are used pervasively in relational databases, distributed transactions, and software transactional memory.
Vance Morrison demonstrated a lightweight reader/writer lock on his blog a couple years back. Although quite small, you can get smaller. Much like the new SpinLock type being made available in .NET 4.0, we can build a ReaderWriterSpinLock that offers several advantages:
- It’s a struct, and so there is no object allocation or space for an object header necessary.
- It’s a single word in size (i.e., 4 bytes).
- No kernel events are ever allocated; we will spin instead.
For cases in which reads are extraordinarily frequent, and writes are extraordinarily rare, this approach can actually be useful. Unfortunately, because one common case in which reader/writer locks scale very well is when hold times are lengthy, as will be shown in my upcoming post, even moderately common writes will result in chewing up a whole lot of wasted CPU time due to (3). If there’s interest, I will look into implementing a variant of this type that uses events for waiting. Clearly this would sacrifice (2).
Some design decisions have been made in the name of keeping this thing lightweight:
- No thread affinity will be used.
- And therefore no recursive acquires will be allowed.
The full code is below, at the bottom of this post. But let’s review the details one-by-one.
First, all state is packed into a single field, m_state. We’ll use the 32nd bit to represent whether the write lock is held, and we’ll use the 31st bit to represent whether a writer is attempting to acquire the lock. As with most reader/writer locks, we will give writers priority over readers because they are supposed to be very infrequent. In other words, once a writer arrives, no more read lock acquires will be permitted. The remaining 30 bits will be used to store the reader count. Some masks make this convenient:
private volatile int m_state; private const int MASK_WRITER_BIT = unchecked((int)0x80000000); private const int MASK_WRITER_WAITING_BIT = unchecked((int)0x40000000); private const int MASK_WRITER_BITS = unchecked((int)(MASK_WRITER_BIT | MASK_WRITER_WAITING_BIT)); private const int MASK_READER_BITS = unchecked((int)~MASK_WRITER_BITS);
Now we can write the four methods: EnterWriteLock, ExitWriteLock, EnterReadLock, ExitReadLock.
Entering the write lock merely entails setting m_state to MASK_WRITER_BIT, provided that we see it available. If it’s not available, we’ll just go ahead and try to set the MASK_WRITER_WAITING_BIT to prevent subsequent read locks from being acquired until we get in. We then go ahead and spin until the lock is available using the new type SpinWait in .NET 4.0, checking the m_state field over and over again. The lock is available if m_state is 0 or MASK_WRITER_WAITING_BIT:
public void EnterWriteLock() { SpinWait sw = new SpinWait(); do { // If there are no readers currently, grab the write lock. int state = m_state; if ((state == 0 || state == MASK_WRITER_WAITING_BIT) && Interlocked.CompareExchange(ref m_state, MASK_WRITER_BIT, state) == state) return;
// Otherwise, if the writer waiting bit is unset, set it. We don't // care if we fail -- we'll have to try again the next time around. if ((state & MASK_WRITER_WAITING_BIT) == 0) Interlocked.CompareExchange(ref m_state, state | MASK_WRITER_WAITING_BIT, state);
sw.SpinOnce(); } while (true); }
Leaving the write lock is actually quite simple. We just set the m_state field to 0, preserving the MASK_WRITER_WAITING_BIT just in case another writer has arrived since we acquired the lock. We use an Interlocked.Exchange (XCHG) operation for this, although we technically could have just done an ordinary write, provided doing so wouldn’t cause memory model or availability problems:
public void ExitWriteLock() { // Exiting the write lock is simple: just set the state to 0. We // try to keep the writer waiting bit to prevent readers from getting // in -- but don't want to resort to a CAS, so we may lose one. Interlocked.Exchange(ref m_state, 0 | (m_state & MASK_WRITER_WAITING_BIT)); }
Entering the read lock is even more straightforward. The lock is available for readers when m_state & MASK_WRITER_BITS is 0. In other words, no writer holds the lock and no writer is waiting for the lock. Once we see the lock in such a state, we merely try to add one to the state value and CAS it in. In this way, m_state & MASK_READER_BITS will be equal to the number of concurrent readers in the lock:
public void EnterReadLock() { SpinWait sw = new SpinWait(); do { int state = m_state; if ((state & MASK_WRITER_BITS) == 0) { if (Interlocked.CompareExchange(ref m_state, state + 1, state) == state) return; }
sw.SpinOnce(); } while (true); }
Lastly, exiting the read lock is the most complicated operation of all. It needs to decrement the reader count, while at the same time preserving the MASK_WRITER_WAITING_BIT:
public void ExitReadLock() { SpinWait sw = new SpinWait(); do { // Validate we hold a read lock. int state = m_state; if ((state & MASK_READER_BITS) == 0) throw new Exception("Cannot exit read lock when there are no readers");
// Try to exit the read lock, preserving the writer waiting bit (if any). if (Interlocked.CompareExchange( ref m_state, ((state & MASK_READER_BITS) - 1) | (state & MASK_WRITER_WAITING_BIT), state) == state) return;
sw.SpinOnce(); } while (true); }
And that’s it.
Here are some single-threaded performance numbers, comparing the relative costs of several locks out there. These are taken from a large number of acquire/release pairs, i.e., ‘for (int i = 0; i < N; i++) { lock.Enter(); lock.Exit(); }’, for a very large value of N:
Monitor 0004487479 RWL read lock (legacy) 0023042785 5.13491x RWL write lock (legacy) 0023118085 5.15169x SlimRWL read lock (3.5) 0009423579 2.099976x SlimRWL write lock (3.5) 0008680855 1.934465x Vance read lock 0004923609 1.097193x Vance write lock 0004802136 1.070123x SpinRWL read lock 0004298525 0.9579604x SpinRWL write lock 0003819024 0.8510431x
The Nx ratios compare the lock in question to Monitor as our baseline. Smaller is better. As you can see, we seem to be on pretty solid ground to start with. But clearly the most interesting part of this whole thing is the scaling numbers--in particular whether read-mode helps with throughput--both for the existing reader/writer locks and our new one. The results may surprise you. That’s coming in the next post...
(Here is the full listing.)
using System;
// We use plenty of interlocked operations on volatile fields below. Safe. #pragma warning disable 0420
namespace System.Threading { /// <summary> /// A very lightweight reader/writer lock. It uses a single word of memory, and /// only spins when contention arises (no events are necessary). /// </summary> public struct ReaderWriterSpinLock { private volatile int m_state; private const int MASK_WRITER_BIT = unchecked((int)0x80000000); private const int MASK_WRITER_WAITING_BIT = unchecked((int)0x40000000); private const int MASK_WRITER_BITS = unchecked((int)(MASK_WRITER_BIT | MASK_WRITER_WAITING_BIT)); private const int MASK_READER_BITS = unchecked((int)~MASK_WRITER_BITS);
public void EnterWriteLock() { SpinWait sw = new SpinWait(); do { // If there are no readers currently, grab the write lock. int state = m_state; if ((state == 0 || state == MASK_WRITER_WAITING_BIT) && Interlocked.CompareExchange(ref m_state, MASK_WRITER_BIT, state) == state) return;
// Otherwise, if the writer waiting bit is unset, set it. We don't // care if we fail -- we'll have to try again the next time around. if ((state & MASK_WRITER_WAITING_BIT) == 0) Interlocked.CompareExchange(ref m_state, state | MASK_WRITER_WAITING_BIT, state);
sw.SpinOnce(); } while (true); }
public void ExitWriteLock() { // Exiting the write lock is simple: just set the state to 0. We // try to keep the writer waiting bit to prevent readers from getting // in -- but don't want to resort to a CAS, so we may lose one. Interlocked.Exchange(ref m_state, 0 | (m_state & MASK_WRITER_WAITING_BIT)); }
public void EnterReadLock() { SpinWait sw = new SpinWait(); do { int state = m_state; if ((state & MASK_WRITER_BITS) == 0) { if (Interlocked.CompareExchange(ref m_state, state + 1, state) == state) return; }
sw.SpinOnce(); } while (true); }
public void ExitReadLock() { SpinWait sw = new SpinWait(); do { // Validate we hold a read lock. int state = m_state; if ((state & MASK_READER_BITS) == 0) throw new Exception("Cannot exit read lock when there are no readers");
// Try to exit the read lock, preserving the writer waiting bit (if any). if (Interlocked.CompareExchange( ref m_state, ((state & MASK_READER_BITS) - 1) | (state & MASK_WRITER_WAITING_BIT), state) == state) return;
sw.SpinOnce(); } while (true); } } }
 Friday, January 16, 2009
I just uploaded a free sample chapter for my Concurrent Programming on Windows book:
2 Synchronization and Time
STATE IS AN important part of any computer system. This point seems so obvious that it sounds silly to say it explicitly. But state within even a single computer program is seldom a simple thing, and, in fact, is often scattered throughout the program, involving complex interrelationships and different components responsible for managing state transitions, persistence, and so on. Some of this state may reside inside a process’s memory—whether that means memory allocated dynamically in the heap (e.g., objects) or on thread stacks—as well as files on-disk, data stored remotely in database systems, spread across one or more remote systems accessed over a network, and so on. The relationships between related parts may be protected by transactions, handcrafted semitransactional systems, or nothing at all.
The broad problems associated with state management, such as keeping all sources of state in-synch, and architecting consistency and recoverability plans all grow in complexity as the system itself grows and are all traditionally very tricky problems. If one part of the system fails, either state must have been protected so as to avoid corruption entirely (which is generally not possible) or some means of recovering from a known safe point must be put into place.
While state management is primarily outside of the scope of this book, state “in-the-small” is fundamental to building concurrent programs. Most Windows systems are built with a strong dependency on shared memory due to the way in which many threads inside a process share access to the same virtual memory address space. The introduction of concurrent access to such state introduces some tough challenges. With concurrency, many parts of the program may simultaneously try to read or write to the same shared memory locations, which, if left uncontrolled, will quickly wreak havoc. This is due to a fundamental concurrency problem called a data race or often just race condition. Because such things manifest only during certain interactions between concurrent parts of the system, it’s all too easy to be given a false sense of security—that the possibility of havoc does not exist.
In this chapter, we’ll take a look at state and synchronization at a fairly high level. We’ll review the three general approaches to managing state in a concurrent system:
- Isolation, ensuring each concurrent part of the system has its own copy of state.
- Immutability, meaning that shared state is read-only and never modified, and
- Synchronization, which ensures multiple concurrent parts that wish to access the same shared state simultaneously cooperate to do so in a safe way.
We won’t explore the real mechanisms offered by Windows and the .NET Framework yet. The aim is to understand the fundamental principles first, leaving many important details for subsequent chapters, though pseudo-code will be used often for illustration.
We also will look at the relationship between state, control flow, and the impact on coordination among concurrent threads in this chapter. This brings about a different kind of synchronization that helps to coordinate state dependencies between threads. This usually requires some form of waiting and notification. We use the term control synchronization to differentiate this from the kind of synchronization described above, which we will term data synchronization.
Read more here...
Related, I was recently interviewed by DZone about the book. You can read my responses here. Enjoy.
 Monday, January 12, 2009
I received some feedback on my previous post, Some performance implications of CAS operations, indicating that a few clarifications are in order. If I had to summarize the intended conclusion, it’d go something like this:
Sharing is evil, fundamentally limits scalability, and is best avoided.
I have to admit that the post was meant to focus more on concrete data, since I expected the meta-point about sharing to be implied. I figured folks would pick up on the link: (i) Sharing memory requires concurrency control, (ii) Concurrency control requires CAS, (iii) CAS is expensive, therefore (iv) Sharing memory is expensive. Many people simply don’t understand how crippling CAS can be when placed in a hot path, and I wanted to point out some (albeit extreme) examples of this point.
I did have a motivation for the post. A lot of people point at lock-free techniques, software transactional memory, reader/writer locks, etc. as ways to improve scalability. Sadly this seldom pans out. Each involves CASs of some sort, and, assuming the lock-based equivalent is written properly (that is, to hold locks for very short periods of time), the alternative can in fact often fare worse. I call this game “count the CASs.” It’s the roundtrips back to shared memory, failed optmistic attempts, cache invalidations, and line ping ponging that kills you.
Some might accuse me of unfairly targeting CAS. That’s hogwash. I’ve been in the trenches for years writing and optimizing systems-level parallel code on Windows. A parallel for loop can go from scaling perfectly to not scaling at all if you choose the wrong granularity for the loop counter increments. And vice versa. Why? Because the frequency of CASs will bring the memory system to its knees. You simply must consider these kinds of things when developing your data structures and algorithms; easing pressure on the cache hierarchy is the only way to scale beyond a handful of processors.
The sad truth is that only radical changes to the way we write software will allow fine-grained parallelism to scale to the numbers we expect in the 5 year time horizon. Hiding more and more conveniently inserted CAS operations auto-magically for folks is not doing them any good. Mostly functional combined with concurrency-safe mutation on guaranteed-isolated object graphs is, in my opinion, the only path forward.
 Friday, January 09, 2009
Along with type systems, I'm casually interested in formal specifications and verification of software. During lunch today, I watched an internal Microsoft Research talk given by Leslie Lamport. The topic was TLA+ -- his formal verification system -- during which he blurted out a couple amusing quotes:
"Writing is nature's way of letting you know how sloppy your thinking is." --- Guindon (cartoon)
"Math is nature's way of letting you know how sloppy your writing is." --- Leslie Lamport (riffing on Guindon)
And related:
"Formal math is nature's way of letting you know how sloppy your math is." --- Leslie Lamport
They made me chuckle out loud, so I figured I'd share them. Unfortunately the talk isn't available outside the company (as far as I can tell), but Lamport has written a book, Specifying Systems, available online, in addition to dozens of interesting papers, on the topic.
 Thursday, January 08, 2009
CAS operations kill scalability.
(“CAS” means compare-and-swap. This is the term most commonly used in academic literature, but it is commonly referred to under many guises. Windows has historically called it an “interlocked” operation and offers a bunch of such-named Win32 APIs; .NET does the same. This set entails X86 instructions like XCHG, CMPXCHG, and certain instructions prefixed with LOCK, such as INC, ADD, and so on.)
My opening statement is a bit extreme, but it’s true enough. There are several reasons:
0. CAS relies on support in the hardware to ensure atomicity. Namely, most Intel and AMD architectures use a MOSEI cache coherency protocol to manage cache lines. In such an architecture, CAS operations on uncontended lines that are owned exclusively (E) within a processor’s cache are relatively cheap. But any contention – false or otherwise – leads to invalidations and bus traffic. The more invalidations, the more saturated the bus, and the greater the latency for CAS completion. Cache contention is a scalability killer for non-CAS memory operations too, but the need to acquire a line exclusively makes matters doubly worse when CAS is involved.
1. CAS costs more than ordinary memory operations, in CPU cycles. This is due to the additional burden on the cache hierarchy, and also because of requirements around flushing write buffers, restrictions on speculation across the fences, and impact to a compiler’s ability to optimize around the CAS.
2. CAS is often used in optimistically concurrent operations. That means a failed CAS will lead to a retry of some sort – typically with some kind of backoff – which is purely wasted work that isn’t present when there isn’t any contention. And 0 and 1 both increase the risk of contention.
The most common occurrence of a CAS is upon lock entrance and exit. Although a lock can be built with a single CAS operation, CLR monitors use two (one for Enter and another for Exit). Lock-free algorithms often use CAS in place of locks, but due to memory reordering such algorithms often need explicit fences that are typically encoded as CAS instructions. Although locks are evil, most good developers know to keep lock hold times small. As a result, one of the nastiest impediments to performance and scaling has nothing to do with locks at all; it has to do with the number, frequency, and locality of CAS operations.
As a simple illustration, imagine we’d like to increment a counter 100,000,000 times. There are a few ways we could do this. If we’re just running on a single CPU, we can use ordinary memory operations:
Variant #0: static volatile int s_counter = 0; for (int i = 0; i < N; i++) s_counter++;
This clearly isn’t threadsafe, but provides a good baseline for the cost of incrementing a counter. The first way we might make it threadsafe is by using a LOCK INC:
Variant #1: static volatile int s_counter = 0; for (int i = 0; i < N; i++) Interlocked.Increment(ref s_counter);
This is now threadsafe. An alternative way of doing this – commonly needed if we must perform some kind of validation (like overflow prevention) – is to use a CMPXCHG:
Variant #2: static volatile int s_counter = 0; for (int i = 0; i < N; i++) { int tmp; do { tmp = s_counter; } while (Interlocked.CompareExchange(ref s_counter, tmp+1, tmp) != tmp); }
An interesting question to ask now is: How much slower will each variant be when cache contention is introduced? In other words, run a copy of each code on P separate processors, incrementing the same s_counter variable by N/P, and compare the running times for different values of P, including 1. You might be surprised by the results.
For example, on one of my dual-processor/dual-core (that’s 4-way) Intel machines, the results are as follows. I’ve run Variant #0 even though it’s not threadsafe, simply because it shows the effects of cache contention on ordinary memory loads and stores.
#0, P = 1: 1.00X #1, P = 1: 4.73X #2, P = 1: 5.38X #0, P = 2: 2.11X #1, P = 2: 10.74X #2, P = 2: 16.70X #0, P = 4: 3.87X #1, P = 4: 7.57X #2, P = 4: 73.35X
All numbers are normalized and compared to the ++ code on a single processor. In other words, Variant #0 run on 2 processors is 2.11X the cost of Variant #0 run on 1 processor; similarly, Variant #0 run on 4 processors is 3.87X the cost of Variant #0 run on 1 processor. Variant #1 gets even worse at 4.73X, 10.74X, and 7.57X, respectively. And Variant #2 explodes in cost as more contention is added, going from 5.38X, to 16.70X, to a whopping 73.35X. Adding more concurrency actually makes things substantially worse.
(The absolute numbers are not to be trusted, and there are anomalies undoubtedly introduced based on how threads are scheduled; I’ve not affinitized them, so they may end up sharing sockets at will. A more scientific experiment needs to consider such things.)
The CMPXCHG example (Variant #2) can be improved by strategic spinning when a CAS fails. Part of what makes the numbers so bad – particularly the P = 4 case – is the amount of lost time due to livelock and the associated memory system interference.
This is an extreme example. Few workloads sit in a loop modifying the same location in memory over and over and over again. Even if they do – as in the case of a parallel for loop in which all threads fight to increment the shared “current index” variable – these accesses are ordinarily broken apart by sizeable delays during which useful work is done. Augmenting the test to delay accessing the shared location by a certain number of function calls certainly relieves pressure.
For example, here are the numbers if we add a 2-function-call delay in between accesses:
#0, P = 1: 1.00X #1, P = 1: 2.54X #2, P = 1: 2.77X #0, P = 2: 1.47X #1, P = 2: 5.19X #2, P = 2: 8.59X #0, P = 4: 2.78X #1, P = 4: 3.67X #2, P = 4: 26.55X
And if we add a 64-function-call delay in between accesses, the micro-cost between the three variants doesn’t matter much. But the contention behavior sure is different. And we can even find some cases where the multithreaded variants run faster than the single-threaded counterpart:
#0, P = 1: 1.00X #1, P = 1: 1.00X #2, P = 1: 1.00X #0, P = 2: 0.59X #1, P = 2: 0.74X #2, P = 2: 0.85X #0, P = 4: 0.51X #1, P = 4: 0.45X #2, P = 4: 1.23X
This is the first time we have seen a number < 1.00X. That's a speedup; remember, we are using parallelism after all.
As you might guess, in the region between 2 and 64 function calls the results gradually get better and better; and beyond 64, they get substantially better. In fact, when we insert 128 function calls in between, we get very close to perfect, linear scaling for all 3 variants:
#0, P = 1: 1.00X #1, P = 1: 1.00X #2, P = 1: 1.00X #0, P = 2: 0.50X #1, P = 2: 0.52X #2, P = 2: 0.52X #0, P = 4: 0.30X #1, P = 4: 0.29X #2, P = 4: 0.27X
(As a reminder, 0.50X is a perfect speedup on a 2-CPU machine, and 0.25X is a perfect speedup on a 4-CPU machine.)
The moral of this story is that nothing is free, and CAS is certainly no exception. You should be extremely stingy with adding them to your code, and conscious of the frequency at which threads will perform them. The same is generally true of all memory access patterns when parallelism is in play, but particularly for expensive operations like CAS.
And even if you’re not using CAS’s directly in your code, you may be using them via some system service. Parallel Extensions uses them in many ways. For instance, when you’re doing a Parallel.For loop, we internally share a counter that is accessed by multiple threads. So even if your algorithm is theoretically embarrassingly parallel, the internally counter management could get in your way. We try to be intelligent by chunking up indices, but we aren’t perfect: if you have very small loop bodies the overhead of CAS could begin to impact scalability. You can work around this by making loop bodies more chunky; one example of how is by doing your own partitioning on top of our library (like executing multiple loop iterations inside the body passed to Parallel.For). Even things like allocating memory with the CLR’s workstation GC requires the occasional roundtrip to reserve a thread-local allocation context by issuing a CAS operation against a shared memory location.
 Sunday, December 28, 2008
As embarassing as it is, the errata for Concurrent Programming on Windows is non-empty.
I've posted an initial listing -- full of primarily simple typos like misplaced commas -- at http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html#Errata.
Sincere thanks to everybody who has reported errors thus far. If you find any additional ones, please email them to me directly: joe AT bluebytesoftware DOT com. We'll attempt to fix as many errors as possible in subsequent printings of the 1st edition and, if that fails, they'll make the 2nd edition.
I've spent the past few months (from September onward) travelling approximately 75% of the time. As a result, I may be slow responding to email concerning the book. I've also not finished putting together the code samples up for download; my current ETA for that is mid-January 2009. I already know there are a few more errata entries lurking within, due to some last minute typographical updates made late in the editing process. If only word processing software came complete with built-in compilers... (excuses, excuses)
In any case, I'd love to receive feedback on the book. Even if it's not about an error. Things you like, things you'd like to see improved, things you wish I'd not written about, requests for clarifications, etc. Just drop me an email. Cheers.
 Saturday, November 29, 2008
I've had an obsession with programming languages for some time now. This probably began the first time I learned of LISP. Most people I know have had a similar "Ah-Hah!" moment associated with LISP, but it was when I first truly realized the deep extent to which a programming language shapes thought -- sometimes in negative ways. LISP put it all into perspective.
Since then, the obsession has only become worse through my employment at Microsoft, where I've had the privilege to work alongside and interact with some of the greatest minds in programming languages. This is an absolute honor. I worked on a few compilers and did some language design, particularly when on the Common Language Runtime team, and my favorite project today is my work on type-system support for static enforcment of concurrency safety and guaranteed isolation. I have found great joy in applying underlying concepts in more niche (and extreme) languages like Haskell to more mainstream languages like C#. My favorite pasttime is tracing back the lineage of languages to their earliest ideas, especially when this leads to the unearthing of a subtle commonality among them. I have been designing one of my own and, while it is undoubtedly a 5-year project that may never see the light of day, I do it for the love of languages.
This book has been stewing inside me for a while now. And after seeing Guy L. Steele and Richard P. Gabriel's infinitely beautiful "50 in 50" presentation at JAOO this year, I decided it was time for it to escape.
Notation and Thought: Behind Computer Science's Most Influential Programming Languages
“That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted.” --- George Boole, Laws of Thought
Programming languages are not only a notation for expression, but also a medium of thought, akin to the duality between natural written and spoken languages. If you can think it, you can create it. The reverse is also true: if a language poses impediments to your thought process, certain solutions to problems are simply unfathomable. Languages are therefore not just what you see “on paper”--each is a unique tool that can substantially limit, or expand, the creative freedom of the programmer in whose hands it sits. Good languages get out of the way, and great ones do a whole lot more.
In the early days, there was of course nothing that resembled modern day languages. Computers had to be told what to do in excruciating detail. One only has to look at modern day assembly language to see that programming a computer in this manner constrains creativity and slows progress. Alan Turing didn’t even have that when he wrote his classic On Computable Numbers with an Application to the Entscheidungsproblem paper, but he at least managed to solve some simple problems: by moving a tape reader and reading and writing symbols, he was able to create the modern day equivalent to subroutines and even add up a number or two. But our industry would have never seen radical advances in enabling technologies, and widespread computer use, that we enjoy today without significant advances in higher-order abstractions.
Plankalkül, or the plan calculus, is widely recognized as the first real programming language. It was designed by a German computer engineer, Konrad Zuse, and first written down in an unpublished manuscript in 1943. The language offered composite (albeit simplistic) data structures, arrays, named variables, subroutines, and moderately sophisticated control flow and looping constructs. Although it was never used in practice, Plankalkül was surprisingly ahead of its time. It was a big step towards more abstract problem solving.
It should be no surprise that subsequent programming languages are as varied in their design as the humans that created them. This fact can be seen by examining the ensuing decade of computing post-Plankalkül. The 1950s saw the invention of four new major languages that fundamentally shaped the future of language design. FORTRAN, or the FORmula TRANslation language, specialized in describing transformations on data and numerics, and was the first non- assembly language to reach widespread use in performance sensitive situations. LISP, or the LISt Processing language, was developed for symbolic processing and, eventually, found a home in artificial intelligence, pioneering many techniques that are still in use today such as first class functions as data, a recursive style of programming, and garbage collection. Its principles were derived from the mathematical logics of Alonzo Church and Haskell B. Curry, notably Church’s lambda calculus from the 1930s. ALGOL, or the ALGOrightmic Language, focused on describing algorithms elegantly, kick-started the imperative family of languages (of which many popular industry programming languages like C++ and Java are members), and later set the de facto standard style for Computer Science education curricula. Its method of encoding algorithms with assignments was far closer to the von Neumann architecture than was LISP, making the resulting programs behave predictably and efficiently. Lastly, COBOL, or the COmmon Business-Oriented Language, became the first domain-specific language (DSL) that targeted non-programming business and finance experts, broadening the general accessibility of computers. Each of the four has had a crucially important role to play in the history of programming languages.
There has been no shortage of language diversity after the birth of the initial four. In fact, hundreds of languages have since come and gone, some enjoying brief or extended periods of popular use. All that have since come have been deeply influenced by the pioneers, but have also contributed a handful of innovative new ideas that help programmers more clearly think about and express solutions to real-world problems. The lineage of languages has branched off into separately named family trees--such as imperative, functional, logic, declarative and domain-specific--only to reunite intimately with each other down the line. Indeed, it really is just one big happy family.
This book traces this lineage through the most influential languages--those that have deeply impacted the way that programmers think and write--and provides insight into the motivation behind them, their major influences, and the important features that each language contributed. Throughout, it is my hope to develop within the reader a new appreciation of the art of programming computers, an understanding of the impact that language has on our thinking, and an excitement about the future of language design that lies ahead.
Joe Duffy November, 2008
 Tuesday, November 04, 2008
Type classes, kinds, and higher-order polymorphism represent some of Haskell’s most unique and important contributions to the world of programming languages. They are all related, and began life as type classes in Wadler and Blott’s 1988 paper, How to make ad-hoc polymorphism less ad hoc. Eventually, Jones introduced the (then separate) concept of constructor classes, in his 1993 paper, A system of constructor classes: overloading and implicit higher-order polymorphism. Eventually these two ideas were unified into a beautiful single set of features (namely, type constructors and kinds) in Haskell.
In this short essay, I’ll explain what these things are and why I’m sad that we don’t have them in C#.
To take the simplest motivating example, say we want to define a generic square function:
square x = x * x
Given a Hindley-Milney type system (with type inference), how should the compiler type this function? The challenge that immediately arises is that, to know the type of x and the function’s return value, we must know something about the function * being called within the body of square. But to know something about that function, we’d need to know the type of x. We’ve entered into a cycle, and have hit a wall. Clearly the type will be something generic, but polymorphic on what?
Imagine that we could infer the type of the * function as follows:
(*) :: a -> a -> b
In other words, * is a function that takes two values, both of type a, and produces some value of another type b. We know its two arguments must be of the same type because in square we pass the same value x to it twice. Given this typing for *, we could then type square similarly as:
square :: a -> b
In other words, square takes a single value of type a and produces a value of type b. The constraint on the type a here is, of course, that some function * is available that is typed as taking an a as input. There’s no obvious way to capture this in the type system, though we might conceive of something like:
square :: (* :: a -> a -> b) => a -> b
In other words, given a type a for which some function * is defined, which takes two a’s and returns a single b, the type of square thus takes an a and produces a b. You can’t say that in Haskell, although we’ll see a bit later that type classes allow similar constraints (with “=>”) to be written.
While this hypothetical typing is extremely general purpose, it would produce considerable challenges in its implementation. Standard ML throws up its hands and infers all mathematical operators (like *) as working with floats, meaning that all of the types above (both a and b) will be inferred under the type of float. (*) is of type float -> float -> float, and square is of type float -> float. Similarly, F# assumes you’re working with ints. Both Standard ML and F# have amazingly rich type inference systems, but this begins to run right up against the limits of what they can do. We’ll see some harder examples shortly.
You can probably guess that Haskell’s solution to this conundrum is to use higher order polymorphism with a feature of its type system called type classes. They allow us to classify types much in the same way types ordinarily classify objects. We can classify the set of numeric types as follows, for instance:
class Num a where
(*) :: a -> a -> a
… other numeric operations …
And then we can go ahead and provide concrete mappings for integers and floating point numbers:
instance Num Int where
(*) = addInt
…
instance Num Float where
(*) = addFloat
Each instance of the type class (in this case, Num) is a bit like a dictionary mapping the named functions (in this case, just *) to other functions that are defined for the concrete type (in this case, supplied in a’s stead). With this information defined, the Haskell compiler can now infer the type of square as:
square :: Num a => a -> a
This inference really just says that the function square is defined for all types a that are in the type class Num. The “Num a =>” part is a bit like a C# generic type constraint, in that it restricts what kinds of a’s can be supplied. Given what has been stated thus far, that’s just Int and Float. So we can only call the square function with types on which multiplication is properly defined, which is exactly what we want.
At this point, we might want to try defining a similar thing in C# using generics. (And for this simplistic example, and others like Haskell’s Eq a type class, we will succeed.) There are two basic ways we could achieve this. The first is to define an INum<T> interface (or abstract class—pick your poison), and give it an instance method to multiply the target with another number:
interface INum<T> {
T Mult(T x);
}
We would then have the basic numeric data types like Int32 and Float implement INum<T>:
struct Int32 : INum<Int32> {
public Int32 Mult(Int32 x) { return value * x; }
…
}
struct Float : INum<Float> {
public Float Mult(Float x) { return value * x; }
…
}
Given these definitions, it would be a breeze to write a Square method that only operates on INum<T>s:
T Square<T>(T x) where T : INum<T> { return x.Mult(x); }
Thankfully, we can recursively reference the T from within the generic type constraint.
Now, of course, there’s no way the C# compiler would infer the necessary INum<T> constraint. But given that we don’t have rich type inference (aside from for local variables) in C#, this doesn’t pose any new problems. Another slight annoyance is that you need to modify the source type to declare support for INum<T>, when a perfectly reasonable implementation could have been provided “from the outside,” but you’ll find that this will only occasionally get under your skin.
The second way we might go about this is to take an approach similar to .NET’s EqualityComparer<T> class, where we have an abstract base class that represents the ability to do something with instances of Ts. And then we only provide implementations on concrete Ts for which that ability makes sense. For example, we could have a Multiplier<T> that looks a lot like INum<T>:
abstract class Multiplier<T> {
public abstract T Mult(T x, T y);
}
Multiplier<T> on its own isn’t usable. But we can provide implementations for Int32 and Float:
class Int32Multiplier : Multiplier<Int32> {
public override Int32 Mult(Int32 x, Int32 y) { return x * y; }
}
class FloatMultiplier : Multiplier<Float> {
public override Float Mult(Float x, Float y) { return x * y; }
}
// And so on …
Now we can write a slightly different Square method that takes a Multiplier<T> as an extra argument:
T Square<T>(T x, Multiplier<T> m) { return m.Mult(x, x); }
Now there isn’t any kind of generic type constraint on Square’s T, but of course we can only call it if we have a concrete instance of Multiplier<T> in hand. And by definition that means there is a Mult method defined that we can call. (This isn’t wholeheartedly true. You can of course call Square<U> for any U, passing in null as the second argument. But presumably the method would check for null and throw. This is a real limitation, however, which would likely push us back in the direction of the original interface solution. If we had non-null types, we could get closer to a fully statically verifiable solution.)
Aside from a lot more typing, and the lack of rich type inference, we seem to have reached parity. The simple examples provided in the literature and Haskell’s Standard Prelude can be implemented in such a fashion. But we are kidding ourselves if we think these are the same thing.
The main problem is that C# doesn’t support higher-kinded type parameters. We haven’t yet seen a type class in Haskell that fully exploits this capability, but there are several. The simplest one I know about in the Haskell Standard Prelude is the Functor type. (Monad is also a great example, but is a bit more complicated (and sufficiently frightening) that this will be a topic for another day.) Functor’s definition is:
class Functor f where
fmap :: (a -> b) -> f a -> f b
The Functor type class offers a single function, fmap. It takes two things—a function that transforms a value of type a into a value of type b and some functor value of type f a—and returns some new functor value of type f b. This looks like an ordinary type class, except for one funny (and subtle) aspect. Functor abstracts over type f, but notice that we’re using f in fmap’s second argument and return type by actually constructing it with two other types a and b! In case you’re having a hard time thinking in Haskell, it’s as though we tried to write this in C# using our interface trick from earlier:
interface IFunctor<T> {
T<B> FMap<A, B>(Func<A, B> f, T<A> a);
}
This won’t compile. We can’t refer to T in the typing of FMap as T<B> and T<A>: it’s not expressible in C# and .NET’s type system. Let’s pretend for a moment, however, that we could. What is an example of class that might implement this? How about something that deals in terms of Nullable<T> instances?
class NullableFunctor<T> : IFunctor<Nullable<>> {
Nullable<B> FMap<A, B>(Func<A, B> f, Nullable<A> a) {
return new Nullable<B>(f(a.Value));
}
}
All you need to do is take a close look at a 1997 paper by Simon Peyton Jones, Mark Jones, and Erik Meijer, entitled Type classes: an exploration of the design space, and you will find a plethora of even more complicated (and useful) examples that use an innocent-sounding aspect of Haskell’s type system called multi-parameter type classes. All of the types are higher-order and are merely moved around and manipulated like abstract (higher-order) symbols. The type system gracefully gets out of the way and allows you to drop abstract type parameters into any holes they fit in, without mandating that you say too much. The secret sauce—as noted earlier—is kinds.
Kinds are used in the implementation of Haskell’s type system, and you won’t mention a whole lot about them anywhere. They basically categorize what kind of types can appear anywhere a type is expected. A great overview (with plenty of context) can be found in Mark P. Jones’s Functional Programming with Overloading and Higher-Order Polymorphism paper and, of course, the Haskell 98 Report.
Here’s a quick rundown. Kinds appear in one of two forms:
- the symbol * represents a concrete type (a.k.a. a monotype), and,
- if k1 and k2 are kinds, then k1 -> k2 is the kind of types that take a type of kind k1 and return a type of kind k2.
Kinds are formed in many ways: the primitive types (such as Char, Int, Float, Double, etc.) are an example of the former, and are of kind *. They “bottom out.” Type constructors, however, like Functor are an example of the latter, and are of kind * -> *. That is, they take a kind k1 (the first *) and produce another kind k2 (the second *). By giving some concrete type T (*) to Functor, we get back a Functor T (also *). The latter is therefore a bit like a function mapping one kind to another. Functions have a kind of * -> * -> *, because a function has two types: the type of arguments (the first *) and the type of its return value (the second *). These compose, so that you might have (* -> *) -> * -> *. And so on. Thinking about kinds can take a bit of getting used to.
But the really useful thing here is that kinds allow you to write higher order type constructors like those we have begun to explore above, like Functors and Monads. I.e., given a type t1 of kind k1 -> k2, and a type t2 of kind k1, then t1 t2 is a type expression of kind k2. This can be applied to the occurrences of f a and f b in Functor’s fmap function. In the type Functor f they are of kind * -> * -> *. When a concrete Functor instance is specified, e.g., by substituting T for f, this turns fmap’s T a and T b arguments to kind * -> *. That is, they still both expect another kind before bottoming out. And therefore we can substitute some concrete U and V types for a and b, to reduce them from kind * -> * to kind *.
Now we’re done. And, as if by magic, it all works.
 Sunday, November 02, 2008
A few months back, while writing my new book, I whipped together a tool to dump information about your processor layout using the GetLogicalProcessorInformation function from C#. You can find the code snippet in Chapter 5, Advanced Threads, of my book. (A developer on the Windows Core OS team, Adam Glass, had also written a similar tool in C++.) I will be posting code to the companion site for my book in the coming weeks, at which point you can easily get your hands on it.
Anyway, I sent the code to Mark Russinovich suggesting it might make a useful SysInternals tool, and he agreed. Now it's up on microsoft.com for download, under the name of Coreinfo: http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx. When run, Coreinfo pretty prints information about the mapping from cores to sockets, cores to NUMA nodes, and what kinds of caches are shared on the machine. Particularly for somebody like me who is always running code on different kinds of machines -- and given that parallel code performance heavily depends on memory hierarchy -- I've found this tool to be invaluable and very helpful. Enjoy.
 Friday, October 31, 2008
Dan Grossman invited me to deliver a talk as part of the University of Washington's Computer Science and Engineering Colloquia series. It was recorded and will eventually air on UWTV, but has also been posted online:
Microsoft's Parallel Computing Platform: Applied Research in a Product Setting
The goal of Microsoft's Parallel Computing Platform (PCP) team is to enable the shift to modern, multi- and manycore hardware, by providing a runtime, programming models, libraries, and tools that make it easy for developers to construct correct, efficient, maintainable, and scalable programs through the use of parallelism. In doing so, tens of years of industry research has been combined and applied in a myriad of ways. This talk examines PCP's current progress, explicitly relating it to specific research of the past and present, in addition to surveying future efforts and possible research opportunities.
http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/details.cgi?id=768
<WMV - streaming, WMV - download, ...>
If you're not aware of the work we're doing in Visual Studio 2010 -- both in .NET 4.0 and C++ -- this talk gives a pretty good overview of all of it. It has a researchy feel to it, with plenty of pointers to interesting prior research that has influenced our work along the way.
 Thursday, October 02, 2008
The word “architect” means different things to different people in the context of software engineering. And it varies wildly depending on the kind of organization you’re in. An architect at a medium sized IT shop might focus on connecting disparate business systems together at a high level, but without diving down into code. An architect at a startup may be more like a tech lead, checking in code like mad, but also keeping the rest of the team in check. And a software architect at Microsoft can play an even varied number of roles because the company is so large and diversity of projects so great.
A colleague and mentor of mine who I respect greatly says that an architect is the guy (or gal) who is in charge of making those decisions which, if made incorrectly, could sink the project.
There is a lot to be said for this. These decisions are those with the broadest, deepest, and longest lasting impact. The decisions themselves are often made by team members initially, but the architect is responsible for providing constant and rigorous technical oversight. Architects set the high level technical agenda, look ahead several releases, and keep the team on course. They are ultimately to blame if the technical foundation is unsound and/or final solution fails to meet expectations. Their butt is on the line.
On one hand, an architect is the lead engineer with most at stake in the project. On the other hand, an architect is more like a member on the project’s board of directors, providing high level guidance and meddling as little as possible (but as much as is necessary) in the day-to-day details.
An architect’s success is measured by what he or she ships to customers, and not by the amazing ideas that were ultimately never realized. This necessarily means an architect’s success is deeply rooted in the team’s culture, work ethic, and ability. He or she needs to work through others to get things done.
There have been some great architects throughout the course of computer science, but who may not have been labeled as such. Linus Torvalds is the architect of Linux, and David Cutler the architect of Windows NT. John Backus was arguably the architect of FORTRAN, Niklaus Wirth the architect of Pascal, Bjarne Stroustrup the architect of C++, James Gosling the architect of Java, and Anders Hejlsberg the architect of C#. Bill Gates was the architect of Microsoft BASIC, and Charles Simonyi the architect of the initial versions of Microsoft Office (Word and Excel). In each case, you can see that the end result is very reflective of one person’s value system and ideas, but took a lot more than just that person to be successful. Each of these people learned to let go of their project just enough that it could achieve the scale that it was meant to achieve, but not so much that the project veered off course. Some projects have multiple architects, but the successful ones usually have one who is really in charge.
Already you can see some subjective opinion being thrown into the mix, and some of it is apt to be controversial. Although not comprehensive, I’ve put together seven guiding principles that I personally aspire to. I’ve certainly not mastered them all, but have always looked up those people around me who seem to have. Why seven? No reason, really. Over the past few years, I’ve tried to spend as much time as possible learning from successful architects, and these stand out in my mind as being the key common attributes that appear to be common among them.
0. Inspire and empower people to do their best work.
Architects ultimately succeed or fail based on the quality of people on their team. Knowing how to inspire and empower these people, so that they can do their best work, is therefore one of the most important skills an architect needs in order to be successful.
You can’t do it all yourself. This can be frustrating at times, and at times you might think that you can (particularly in times of frustration). I’ve personally hacked together 1,000s of lines of code that I’m incredibly proud of in a weekend, and that would have taken weeks or months to get done if I had to instead explain the idea to somebody else and wait for them to write those same 1,000s of lines of code. And the 1,000s of lines they write of course wouldn’t end up being the same as the ones you’d have written. And they may decide that they don’t like the design after all, start discussing it with colleagues, stage a mutiny, and ultimately overthrow what once seemed like a great idea. This is a tough pill to swallow. But it’s a sad fact of life that you need to learn to be comfortable with.
The same thing would have happened if you were the one to implement the idea, of course; the difference is that somebody else needs to be empowered to take the kernel of an idea, and run with it. That entails reshaping it as necessary to make it realistic and successful.
I’m not suggesting architects don’t write code (quite the opposite: see #3 below), but you can’t write it all (except for very small projects). If you buy the argument that an architect is just the leading senior engineer on the project, then by definition the architect is probably qualified to write quality code quickly. But what about the code they don’t write? Other people on the team need to write it, and the architect needs to have enough time (where he or she isn’t hacking code) to inspire those people to write the right code. This takes energy and effort. You need to paint a compelling picture of the future, but with enough open-endedness such that the team can flex their creative muscles and fill in the details.
This is the only way to scale. And architects need to scale to achieve broad impact.
Architects should also welcome all ideas with open arms. You want to foster an open and energetic environment on your team, where intellectual debate is the norm. All ideas are fair game.
That’s not to say all ideas are good ones, and ultimately the bad ones need to die a quick and painless death before going too far, but an architect who won’t even entertain new ideas from the team (typically because of NIH syndrome (i.e., Not Invented Here)) often drive away the best engineers. Great engineers hate to be told what to do. They don’t want to feel like they are walking in the shadows of somebody else. They want to use the skills that make them so great, which involves inventing bigger, badder, and more impactful designs. And you want them to use these skills too, because that’s why you hired them: these skills are crucial to the success of your project. Part of your role as the team’s architect is to recognize who on the team has the most potential, and to arrange for them to have as much leeway and creative freedom as possible. You don’t want to end up with a bunch of lackeys whose job is to “just implement” your ideas, because you’ll get what you paid for.
It’s a true sign of success when the culture you impart unto your team allows them to invent things in the spirit of your own design principles, but without you needing to do it yourself. Jim Gray, for example, inspired countless people to do great things. Does he get credit for each of those ideas? Of course not. But was he indirectly responsible for them to some degree, and do they all have a little Jim Gray in them? Absolutely. Being an architect on a team is similar; not every idea has to be your own. In fact, it’s far more powerful if few of them are.
1. Oversight, but not dictatorship.
That brings me to technical oversight. Because an architect is typically not a manager for his or her project (although in some cases he or she may be), arms-length influence needs to be used to get things done. In fact, the architect may have very little to say over specific project management, scheduling, and budget decisions, but is typically on the senior leadership team for the project. So when I talk about “leeway” above, I’m talking about the degree to which an architect monitors and attempts to meddle with the progress of the team. While it’s tempting for an architect to set the ship sailing to sea, and then turn around to work on the next big thing, this almost never works. The initial vision and idea is far from a shipping solution, and software engineering only gets interesting once you actually try to build something. Ideas are cheap. The architect needs to help the team work through the ramifications of certain technical decisions that were made up front, and help with the continual course correction.
Because an architect’s butt is ultimately on the line, he or she needs to work as fast as possible to correct problems when something goes wrong. This implies the architect is involved enough to notice when something goes wrong, hopefully well in advance of anybody else seeing it. I’ve seen many models that work, ranging from the architect being the approver for all major design decisions, to the architect simply reviewing all major design decisions after-the-fact, to the architect delegating this responsibility to trusted advisers. For example, Linus Torvalds for the longest time required that all checkins to the Linux code base be reviewed by him. Anders Hejlsberg still effectively approves each C# language design change. In my opinion, the closer to each major decision the architect can afford to be, the better.
Left to its own devices, the team would veer off course in no time. That’s not because of malicious intent, but rather because of the sheer diversity of software engineers. This diversity is present on many levels: in skill level, taste (which is hard to measure: more on that in #2 below), motivation, work ethic, interpretation of the vision, personal beliefs and experience, and so on. An architect acts as a low-pass signal filter, smoothing out any irregularities that deviate too far from the core design principles.
In Tony Hoare’s ACM Turing Award paper of 1981, The Emperor’s Old Clothes, he explains the risk of not providing this kind of architectural oversight:
“’You know what went wrong?’ he shouted - he always shouted – ‘You let your programmers do things which you yourself do not understand.’ I stared in astonishment. He was obviously out of touch with present day realities. How could one person ever understand the whole of a modern software product like the Elliott 503 Mark II software system? I realized later that he was absolutely right; he had diagnosed the true cause of the problem and he had planted the seed of its later solution.”
Sadly, this responsibility often entails being “the bad guy”. Sometimes you need to mercilessly kill an idea because it would put certain parts of the project at risk. Other times you need to let somewhat bad (but not too impactful) ideas go. There’s a tradeoff here, because each time you kill an idea you’re going to leave somebody feeling burned. And you may waste peoples’ time, depending on how much time has already been invested in that idea. Some battles are best left unfought. There is an art to be learned here: if you can get those with the idea to firmly believe that there has to be a better way, you can avoid being seen as the bad guy. “Sit back and wait” can work in some cases, but it can backfire too.
The deep involvement in the technical design details unfortunately means that the architect can become the bottleneck if he or she is not careful. This can slow the team down. Some slowdown can admittedly be a good thing, because it has the effect of forcing more thoughtfulness in each and every decision. But as the team grows, the granularity of decision oversight necessarily has to change to ensure the team is empowered to make progress. In order for this to work, you need to have trusted individuals who are involved at a finer granularity and will use the same principles and values. This takes trust and time.
2. Taste is a hard thing to measure, but is invaluable.
Software engineers like to measure. Many people try to make design decisions based on quantitative data, even though they know that engineering is more of an art than a science. But there is one common trait that, as far as I can tell, is impossible to measure, and yet common to all of the great software architects I know: good taste. And because it’s impossible to measure, those who lack it have a hard time understanding the difference between a design with good taste and one with bad taste.
There is a certain elegance and beauty to the designs created by architects with good taste. When you see it from a distance, you know it, but when viewed under a microscope—the kind of microscope used when debating the finer points with other engineers on the team—it is much harder to detect. Often it’s incredibly difficult to articulate why some particular design has good taste, which makes it even harder to justify. Eventually people are willing to trust your judgment because they begin to see it too.
In fact, good taste is perhaps one of the most important skills an architect needs to have. Bad taste leads to clunky designs that nobody likes to use. Steve Jobs knows this. And yet taste is probably the most difficult skill for an architect to develop, and one of the subtler ones that few people recognize as being necessary. Many managers think that throwing more engineers at a design problem will solve it, when in reality often all that is necessary is one person with very good taste and an eye for detail.
I’m not certain where taste comes from: an innate skill? Perhaps, but not exclusively. In my best estimation, good taste can be learned from paying close attention to the right things, taking a step back and viewing designs from afar often enough, being learned in what kinds of software has been built and was successful in the past, and having a true love of the code. That last part sounds cheesy, but is true enough to reemphasize: if you don’t feel a certain passion for your code and project, it’s a lot easier to let bad taste run rampant, because your care level isn’t as intense as it needs to be.
3. Write code and get your hands dirty.
The best architects realize that code is king. It rules all else. At the end of the day, Visio diagrams, high level vision documents, whiteboard works of art, design documents, emails, functional specifications, and so on, are all a means to an end, not the end itself. The code is your product, and if you don’t understand the code, you don’t understand the state of the project. And if you don’t understand that, you’re not in a position to know what’s working well, what isn’t working, and you can’t possibly have the deep understanding necessary to influence the engineers on the team. You’ve lost control.
The worst architects couldn’t code themselves out of a cardboard box. If you’re not writing actual product code, you’re not an architect: you’re an ivory tower has-been, and probably doing more damage than you are helping matters. Do your team a favor and move into management as quickly as possible.
Writing code also has the benefit of ensuring that you maintain credibility with the team. It’s easy to dictate crazy and grandiose ideas, but if you’re the one who has to implement such a grandiose idea, you’re apt to be more sympathetic with and mindful of the other engineers of the team. You need to keep yourself grounded and writing real product code will help to ensure your technical decision making carefully considers the implementability and down-to-Earth ramifications of your decisions.
Moreover, you need to be a programming expert. People need to respect your abilities, and you want your team to look up to you. You want them to come and ask for your advice because they want it, and enjoy it, and not force them to deal with you simply because of your position on the team. All of the great architects I’ve worked with have inspired me to grow simply because they know so damn much, and because I learn something new every time I interact with them. If they didn’t write code and understand the nitty gritty technical esoterica, this relationship would have been a shallow one.
4. The power of the dyad: know your weaknesses.
Architects need to play a dual role in understanding both business and technical needs and strategy. The degree of business savviness varies greatly among architects, although the best architects I know have a unique ability to understand both sides of the coin. But at the end of the day, they are first and foremost technology wonks, and the business angle is more of a curious hobby. In music, two notes sounding together form dyad, while three or more form a chord. The best architects I know realize their relative weakness on the business end of things and partner up with another senior leader with complementary skills, to fill in the gaps: this forms a harmonic interval. A dyad.
The partnership needn’t entirely be “business” vs. “technical”, although in commercial software that’s more often than not the two opposing forces. For example, my impression of the development of Scheme is that Guy Steele played the role of the architect while Gerald Sussman was the more business-oriented advisor, looking at how Scheme might be used to advance the broader research agenda but not necessarily meddling in the technical design details of the project.
If an architect is 80% technology and 20% business, partnering with somebody who is 20% technology and 80% business can be a killer combination. This allows you to bounce ideas off one another, and to get a certain level of objective feedback from a different perspective. If you’ve got a great technical idea, and bounce it off another techno-nerd, you might spend hours or days debating technical details that ultimately boil down to a matter of taste. But if you take that same idea and bounce it off your business partner, he or she is likely to provide more pertinent feedback: does it make sense from a business perspective, will customers need it, will it open up new product or revenue opportunities, are there more pressing matters to focus the team on, etc. These are things that, being a technology guy (or gal), wouldn’t immediately come to mind. But remember: it’s all about the customer.
5. It's for the customer, not you.
The best engineers often succeed because they focus on scratching a personal itch. That’s what Linus Torvalds, Bjarne Stroustrup, and countless others did. This is why Donald Knuth created TeX. The idea for a new technology thus begins as a very personal and selfish act. “Build something you’d use yourself, and the customers will come” is a common (cliché) idiom. Although there is certainly truth to this, it’s true only because the very fact that it is bothersome to the founding engineer is likely indicative that it’s bothersome to a broader set of people. It’s an example, where an example is just one element in a set that is used to demonstrate some common attribute among all elements in that set. Those people are your customers.
As a technology matures, it’s important to realize—particularly when building commercial software—that actual human beings will want to use the technology. It’s important to understand and respect their needs. It’s important to, at some point, realize that you’re not, in fact, building a system entirely for your own personal use. Not realizing this point can blind you and make you neglect the need to partner with somebody who understands the business angle of things. It can also lead to a feeling of needing to develop the perfect idealized solution and never ship to customers. Hey, when there are endless technical problems to work on, who would want to ship anyway? By its very definition, shipping software means that you’ve solved all of the major technical problems within a certain scope. What fun is that?
The fun is that you’re able to make an impact on your customers’ lives, hopefully for the better. Your initial technical vision has come to fruition, and you can move on. You get to prove your ideas by having real human beings to use the end product. If you never get to that state, then you’ve done some possibly interesting research—which is hopefully documented and used by somebody someday in the future to actually impact people by delivering a system based on those ideas—but you haven’t architected a product. You’re a researcher, not an architect.
6. Admit when you're wrong, fall on your sword, and then fix it.
You are going to be wrong sometimes. Trying to do big and bold things necessarily involves some risk. Being an architect requires a careful balance between sticking to your guns—your guiding principles and technical vision—and realizing when things aren’t working out and course correcting before it’s too late. It’s hard to tell when things are beginning to go off course, but when they’ve already gone off course it’s usually obvious. A common telltale sign that things are in trouble is when the team no longer believes in the vision. This may translate into attrition (often of your best engineers first), or just hallway grumblings. Listen carefully. If you’re not involved in the design decisions, writing code, and actually playing a significant role in your team’s daily lives, then you’re apt to miss this. As the architect, you are responsible for responding as quickly as possible to such situations before the shit hits the fan.
Some architects can fall into the trap of using dogma over intellect. Firm principles are of course something I’ve stressed throughout this article. But you need to be honest with yourself and admit when things are not going well. An architect who stands at the helm of a sinking ship, proclaiming that the ship stay its course because the brave new world lies ahead, will only drown (alone) when the ship finally goes underwater. Although this architect can then go around blaming his team for the failure (“if they had only seen the vision and stuck around, we would have succeeded”), the project will be long gone by then. It’s harder, but more noble, to recognize the problems proactively and do your best to fix them.
For example, Tony Hoare describes in the same ACM Turing Award paper mentioned above, how he felt responsible for the failure of the Elliot 503 Mark II project:
“There was no escape: The entire Elliott 503 Mark II software project had to be abandoned, and with it, over thirty man-years of programming effort, equivalent to nearly one man’s active working life, and I was responsible, both as designer and as manager, for wasting it.”
It can be particularly disturbing to realize that a large number of people have been going off in the wrong direction on your watch. Yes, you wasted their time. But you have to learn what went wrong, internalize it, and commit to never making the same mistake twice. You owe it to them to respond promptly. Everybody on the team will have learned and grown from the circumstances, and if you’re lucky the situation is salvageable. Sometimes it won’t be. But in any case you will gain the respect of many around you by making the right decision; particularly if you’re the only person with the broad technical responsibility, understanding, and insight necessary to make such a decision, people will feel relieved when you make it. And if you don’t make it, people will curse you for it.
In conclusion
I’m sure there are many other laundry lists of skills people might come up with that are necessary to be an effective architect, but these are a few of the things I see and respect in the people I look up to. I’ve named some of these people throughout this article. The most common trait is that they have done great things and left their mark on the industry. Being an architect, in the end, is all about helping others to succeed. If you’re a really good architect, you’ll inspire people and rub off on them. You’ll gain a certain level of respect that is unmistakable and priceless. And that, in my opinion, is far more fulfilling than anything you could accomplish on your own working in a vacuum.
 Wednesday, October 01, 2008
The October 2008 MSDN Magazine issue just went live with 5 articles on concurrency, plus the editor's note. Four of the articles are written by members of the Parallel Computing team here at Microsoft, including one by me:
Enjoy the text, and be careful not to overdose on the excess of parallelism goodness. This edition was timed intentionally to coincide with the PDC. I'm hoping to see you there, because we have a plethora of exciting things to show, spanning managed .NET and native C++ programming. These articles are really just teasers.
|