Map, Everything's An Object, and Inline

One of the reasons that functional programming is worth studying is that it abounds with opportunities for implicit parallelization. As Jon Harrop discusses in this post, the map function takes a function object f and an array [a, b, c, d] and returns [f(a), f(b), f(c), f(d)] (syntax varies from language to language, of course, but you get the point).

The optimist sees this and says "Ah hah! The compiler can simply distribute these calculations to a thread-pool and have a performance advantage on a manycore machine." And this is true if (a) f is quite lengthy or (b) the array is quite large. Otherwise, the overhead of distributing the calculation across cores / processors can very well be greater than performing the map "in core." In the worst case, when function and data are already inside the initial core's cache, the performance hit for distributing it would be very substantial.

This is a familiar theme in programming languages: a theoretical capability runs afoul of implementation realities. The best design decision in Java was "(Almost) Everything's an object": numbers and strings -- the most commonly used data types -- have different semantics (what the .NET world calls "value semantics") because they aren't pure objects. And they aren't pure objects for performance reasons (immutable strings are also good for a couple other reasons). To this day, you can feel an occasional performance hit with pure object-oriented languages (before you flame, keep in mind that I'm about to deploy a Ruby-based service right into the middle of a live multimillion-dollar application and Ruby's not only pure objects, it's interpreted. So I'm not one of those who doesn't understand that performance, productivity, and responsiveness are different things).

Another situation is C/C++'s inline keyword. Structured programming theory tells us not to repeat ourselves -- to define functions rather than writing the same code in multiple places. But in the not-terribly-distant past, the cost of a function call was large relative to local operations (a situation that holds today with some embedded processors). So C and C++ have an inline keyword to say "don't generate a function call for this, generate the code inline." But unlike Java's success with "Everything's an object: almost" the inline keyword turned out to be pretty much a disaster. Now, to this day some people doing embedded systems undoubtedly use inline to great effect. But most developers do a poor job estimating the benefit of the inline keyword. Because, just as distributing map can be counter-productive, inlined code can decrease performance (the on-chip caches of modern processors make code size and data locality very important to performance). (And don't even get me started on template metaprogramming.)

So, what does this history suggest for the manycore era?

  • Languages that promise that the solution is "every call's distributed" (or other "pure" approach) will either fail outright to deliver performance gains or will require very sophisticated just-in-time code generators (this is similar to the situation with pure object-oriented languages such as LISP and Smalltalk, where commercial VMs are leaps and bounds beyond academic "proof of concept" interpreters / VMs). The problem with this is that the development of sophisticated JITters requires time and experience, so "the language takes care of it" solutions face a very big chicken-and-egg problem.
  • Languages that shift 100% of the burden of parallelization to the programmer (a ParallelAttribute that can be applied to blocks or functions, say) will work in the hands of experts but will be disasters in the hands of the mainstream.
  • Some kind of hybrid approach that purists decry as tainted but that solves 80% of the problem (a la Java's "(Almost) Everything's an object") will be the winner.

(No, I don't know what the hybrid approach will be.)