I often speak of the need to develop programming models whereby developers write code in the most natural way possible, and it just so happens to be inherently parallel. I don’t believe the lion’s share of developers want to rearchitect and rewrite their code with parallelism at the forefront of their development process. They don’t want to think about shipping memory over to the GPU and launching a highly-specialized data parallel kernel of computation, nor do they want to have to add locks and transactions everywhere to make things safe. But I do, however, believe the lion’s share of developers wouldn’t mind if their code ran faster as hardware got faster (via more cores).
(To be clear, there are certainly a lot of developers who will be happy to write specialized code if it means eking out every last bit of performance on their machine. But that’s the minority.)
This viewpoint tends to get a lot of skeptical looks from people who quickly point out that this has been tried countless times before, and always leads to failure. They, of course, are referring back to the 80’s and early 90’s where “dusty deck” parallelization was all the rage, mostly in the realm of vectorization and HPC. To be fair, there were some mild successes in getting floating point loops parallelized, but there’s no wonder these attempts had little longevity. No touch solutions are always inadequate. Trying to make a fundamentally non-secure program secure, by way of, say, virtualization, may work in some constrained circumstances. But the right solution is to develop models and practices that lead to security-by-construction.
Furthermore, languages were (and in most cases still are) lacking some major prerequisites for large-scale automatic parallelization:
- Safety. Unless a compiler can reason about the determinism of a program when run in parallel, one cannot prove that its results will remain correct when parallelism is added. Compilers are therefore limited to parallelizing highly-specialized recognized patterns, like loops comprised solely of floating point additions of two arrays indexed by the loop iteration.
- Performance. Rampantly parallelizing a huge program wherever possible is dangerous, will drain performance, and make power consumption skyrocket. Dynamic techniques like workstealing and static techniques like nested data parallelism and profile guided feedback need to work together to inform these decisions. Machine-wide resource management needs to know about the memory topology, machine load and policy, and make informed decisions based on them. Although there has been a lot of disparate research in these areas over the years, they have only recently been coming together. Certainly in the 80’s, they were in their infancy.
- Declarative patterns. Most of the prior art was done in FORTRAN, a standard imperative language riddled with loops, effects, and assignments. Programs need to be written with as few dependencies as possible in order to expose large amounts of parallelism, and the von Neumann inspired languages fall short of this aim. Data comprehensions allow set-at-a-time computations to be expressed in a higher-order way, and newer languages like Fortress have language semantics that permit parallel evaluation in many more areas, like argument evaluation. And application models that encourage isolation and loose state coupling allow coarse partitioning.
In addition to all of those three things, we must have realistic expectations. Even if a program were fully safe to parallelize, as many Haskell kernels are, we would seldom see perfect scaling. Buying a 128-core machine doesn’t necessarily give you a 128x speedup. Why? Because there are still portions of the code that will end up less parallel than other portions, and some parts may even continue to run sequentially. There will always be I/O and waiting: these are real programs, after all, and real programs tend not to be 100% computation. They need to do something useful with the real world. Moreover, safety does not mean “dependence free.” And data dependencies are ultimately what limit parallelism.
My stated goal would therefore be that parallelism in programs is solely limited by data dependence. Safety issues are handled by construction. Performance is (mostly) handled by the system, although as with all things, there will be some amount of measurement, hints, and tuning necessary. But hopefully a huge part of tuning performance will be seeking out needless dependencies, or finding new algorithms that have different dependence characteristics. And with that, we can focus our energy on raising the level of abstraction and pushing more declarative patterns that are broadly useful. Over time as more and more programs are written in this fashion, they become more and more naturally parallel.
What do you think? Am I crazy? Perhaps. But I still know we can do it.