Fran Allen delivers Organick Lecture (click to enlarge) |
Fran Allen was the Turing
Award winner for 2006. This afternoon she's giving the
University of Utah's Organick Memorial
Lecture. I've reported on some of these in the past few years:
I try to come every year. I find it's something I'm inspired by each
time.
The grand goal of high performance computers right now is a 1
petaflop machine. This requires 1,000,000 gigaflop processors.
Wow. She shows a semilog plot of peak speed vs year introduced that
is a linear line (Moore's law at work).
Much of Allen's work in the 80's and early 90's was around the PTRAN
system of analysis for parallelism. The techniques are used, for
example in the optimization stage of IBM's XL family of compilers.
Because more and more transistors are being placed on chips, they're
using more and more energy--getting hotter. Part of the
solution--which we're seeing play out--is multi-core chips. This
requires parallelism to achieve the performance users expect. But
making use of multi-codes requires that tasks be organized by either
users or software to run in parallel.
By 2021, there will be chips with 1024 cores on them. Is parallelism
the tool that will make al these ores useful? John Hennessey has
called it the biggest challenge Computer Science has every faced. He
has credentials that might make you believe him. Allen says that
it's also the best opportunity that Computer Science has to improve
user productivity, application performance and system integrity.
For parallel (superscalar, etc.) architectures,
compilers--software--have been used to automatically manage
scheduling tasks so that they can operate in parallel. What about
those techniques will be useful in this new world of multi-cores?
Allen says we need to get rid of C--soon. C, as a language, doesn't
provide enough information to the compiler for it to figure out
interdependencies--making it hard to parallelize. Another way to
look at it is that pointers allow programmers to build programs that
can't be easily analyzed to find out which parts of the program can
be executed at the same time.
Another factor that makes parallelization hard is data movement.
Allen offers no silver bullet. The latency of data movement inhibits
high performance.
The key is the right high level language that can effectively take
advantage of the many good scheduling and pipelining algorithms that
exist. If we don't start with the right high level language, those
techniques will have limited impact.
She presents some research from Vivek
Sarkar on compiling for parallelism. Only a small fraction of
application developers are experts in parallelism. Expecting them to
become such is unreasonable. The software is too complex and the
primary bottleneck in the usage of parallel systems. X10 is an
example of a language (object oriented) that tries to maximize the
amount of automatic parallel optimization that can be done.
Major themes include cross-procedure parallelization, data dependency
analysis, control dependency analysis, and then using those analyses
to satisfy the dependencies while maximizing parallelism.
Useful parallelism depends on the run time behavior of the program
(i.e. loop frequency, branch prediction, and node run times) and the
parameters of the target multiprocessor. Finding the maximal
parallelism isn't enough because it probably can't be efficiently
mapped on the multiple cores or processors. There is a trade off
between the partition cost and the run time. Finding the
intersection gives the right level of parallelism--the level
that is the most efficient use of available resources.
Interprocedural analysis is the key to whole program parallelism.
One of the PTRAN analysis techniques was the transform the program
into a functional equivalent that used static single assignment.
This, of course, is what functional programming enthusiasts have been
saying for years: one of functional programming's biggest advantages
is that functional programs--those without mutation--are much more
easily parallelized than imperative programs (including
imperative-based object oriented languages).
There's a long list of transformations that can be done--everything
from array padding to get easily handled dimensions to loop unrolling
and interleaving. Doing most of these transformations well requires
detailed knowledge of the machine--making it a better job for
compilers than humans. Even then, the speedup is less than the
number of additional processors applied o the job. That is, applying 4
processors doesn't get you a speedup of 4--more like 2.2. The speed
up--at present--is asymptotic.
Tags:
events
utah
parallel+computing
programming+languages
functional+programming