By now you’ve probably read things like Herb Sutter’s free lunch paper. And if you follow my blog at all, you’ll know that I do a bit of writing and thinking about how Microsoft can make our platform better suited for the multi-core era that stands in front of us.
Most people, when considering the topic of parallelism vis-à-vis multi-core, start by jumping straight to the bottom of the stack. I’ll admit that I sure did. They think about threads, locks, and the associated headaches. Some even think about the chip architecture and memory hierarchy. They take it for granted that the work exists. But these same people seldom stop to think—or when they do think often hit the same wall—about what workloads will actually substantially benefit from massive amounts of parallelism. This is a difficult topic.
Scientific computing of course has this nailed pretty good already. But how much of the code do you write that actually resembles scientific problems, like n-bodies, heat transfer, fluid dynamics, and the like? My guess is that, for most of Microsoft’s customers, the answer is: Not much. That’s especially true on the client, where data-intensive operations are often shipped to a high-end server for processing, leaving what amounts to quasi-workflow orchestration initiated by UI events, for example. I’m not going to refute the massive gains in CPU scalability we’ve seen over the past 10 years due to superscalar execution, via techniques like pipelining and branch prediction, and the effect that has had on client and server programs alike. But for most application code today, the network and disk are the limiting factors, not the CPU.
Of course, to the extent that there is work the CPU must perform for any problem—even for IO-bound ones—code needs to be architected to separate logical tasks, ensuring that a bunch of otherwise ready-to-run work doesn’t get backed up behind a blocking call unnecessarily. And of course, separating logical work is important for other reasons, like avoiding a hung UI thread. Unfortunately, we don’t make this overly easy today. Win32 and WinFX APIs (nor the associated documentation or tool support) are not overly helpful when it comes to figuring out the performance characteristics of the code they invoke, including latency and blocking. This makes it tricky to architect things as I suggest. New programming models like the CCR provide the infrastructure that could facilitate such a shift, but it will take hard work to get to a reasonable place.
Back to workloads. Consider server applications for a moment. The model of concurrency there is actually quite simple. And in fact I believe the majority of server programs will be equipped to exploit multi-core right away. Each incoming request is considered a logical task and is assigned to an available thread of work, often using the CLR’s thread-pool. Sharing between concurrent requests is (hopefully) minimal, meaning that the one-thread-per-request model leads to naturally good scaling. This works up to a point. Once the average number of available CPUs surpasses the average number of incoming workers, the need to assign multiple CPUs to a single request becomes more important. This is obviously very workload dependent. Databases already do this with individual queries. Their use a single-thread-per-request model, but often use individual query parallelization to get better utilization. SQL Server added support for this in 7.0. I’ve been working quite a bit over the past year on similar techniques for LINQ. I’m almost to the point where I can disclose more information publicly, in the form of a paper.
Search is clearly a workload of recent importance that, whether on the client or server, benefits tremendously from parallel execution. This applies not only to the act of searching, but also to the act of indexing the data in preparation for search. MSN and Google’s current desktop search products are cognizant not to interrupt your primary work by doing indexing while your computer is idle. But given a bunch more cores, they needn’t wait. Further, parallelizing search is a well researched topic. You still need to solve some tough problems like ensuring parallel tasks aren’t contending heavily for the disk (becoming IO bound), but it’s very possible.
There are of course other workloads. Graphics processing on modern computers is extremely parallel, currently handled by the GPU. But I am going to wrap up, and summarize all of this by saying: It remains to be seen whether most mainstream Windows programs can become highly parallel, and if they can, how profitable it will be. We'll also find out over time whether reaching that stage will require radically new programming models and a gradual shift over time. I am optimistic, and confident that parallel execution is the direction we ultimately need to go down. Surely the workloads are there, seemingly obscured by the traditional sequential approach to software.