From: Lance Hammond [lance@leland.stanford.edu]
Sent: Thursday, October 04, 2001 1:31 AM
To: Ian Buck; Ujval Kapasi; Bill Mark
Subject: Streaming from the Viewpoint of DSP-C

Guys:

As requested, here are my opinions about some of the features that a
streaming language needs to have and some of the problems that will have to
be overcome in the development of such a language.  If you've already read
my writings on DSP-C, a lot of the following won't be particularly new and
earthshaking.  If not, then you might want to look them up and read / skim
them (they are now available at http://www-lance.stanford.edu/Public/ ).

I basically believe that we should make a language that allows pretty
low-level access to individual operations performed within the processors,
much like C does today (a sort of "high-level assembly" that is *very*
portable from architecture to architecture).  On the other hand, in order to
take advantage of streaming-style memory systems the language needs to be
*somewhat* more explicit about the memory model, an area which is completely
abstracted away with most varieties of C other than the "register" specifier
for variables.  Compilation to modern streaming-style memory hierarchies is
just too complex for such a simple model.

A fair amount of abstraction is still required, however, in order to ease
programming and prevent us from becoming too hardware-specific.  I am a
little religious on this point because I feel that any language we do should
be fairly architecture neutral, except for some occasional "pragma"-type
option flags that may be scattered throughout the code.  Otherwise, we risk
becoming just a programming language-of-the-month attached to an
architecture project or two, and that's not something I'm interested in.

Of course, the more abstract and high-level the language is, the more the
compiler must be able to analyze in order to generate efficient code.  I'm
personally no big fan of compilers, but I do appreciate the fact that they
allow me to run the same code on an x86, a MIPS machine, and sometimes even
my Macintosh.  As a result, my main goal is to create a language that has
the *minimum* amount of extra stuff necessary, beyond a baseline of ANSI C,
to allow a relatively simple compiler to generate fairly good code on a
streaming-type machine (or a DSP processor, normal processor with multimedia
extensions, etc.).

In the remainder of this email, I've tried to answer the questions Ian
mentioned, although I've reordered them somewhat.

> Hardware Model:
> What are the underlying assumptions about the hardware model?

In the design of DSP-C, I concluded that an abstract model consisting of one
or more processors (which I have not defined in detail -- I regard most
internal processor differences as something that should NOT be reflected in
the language itself) connected to a three-level abstract memory system,
which I view as the crucial portion of various machines that must be
abstracted away:

Processor registers: This very small chunk of "memory" attached to each
processor is mostly used to hold temporary values for a few cycles,
facilitate software pipelining, etc.  Small variables or parts of variables
(such as a few values stripmined out of an array) may also be register
allocated, when possible.  Except for the array stripmining, a function
which could have significant effect on an architecture with a lot of
registers like Imagine, this level is treated by a compiler just like
registers in conventional processors today.

Local compiler-controlled memory: These on-chip memories, each associated
with one or more processors, are an addition to the "normal" compiler view
of the world.  From a compiler perspective, these small memories (memory
mats in Smart Memories, the SRF in Imagine, etc.) are treated as big
register files that can contain arrays -- or very large stripmined-out
sections of arrays -- instead of individual values.  On most machines, these
local memories may be randomly accessed, but in the case of Imagine they may
not.  In effect, they are a software-controlled cache (for arrays only, in
the Imagine case).  The compiler is then responsible for doing register-file
style allocation within this local memory (as the Imagine project has
already done) and scheduling "loads" and "stores" to and from this memory
(which the Imagine folks have done by hand up to this point).

Main memory: Large, off-chip memories are abstracted together into a single
layer that just looks like an "infinite" sized memory to hold all of the
data needed by the program.  This is just like the conventional compiler
view of main memory, except that it is accessed primarily by special
array/vector/stream accesses that will probably be handled with DMA-style
hardware.  For very large systems, this layer could be further split up into
"local main memory" and "remote main memory," with the compiler trying to
allocate data to minimize remote memory use as much as possible using
techniques like those used in compilers from the 1990s such as SUIF.

Interprocessor communication can be abstracted (from the compiler's point of
view, at least) as reads and writes to "shared memory."  In different
systems, this "shared memory" could fall into any of the three different
layers, and will require varying amounts of code generation.  Explicit
inter-cluster communication in Imagine looks essentially like a set of
"shared registers," while processors that share a local and/or main memory
bank may obviously pass data from one to another via loads and stores to and
from the actual shared memory.

> MIMD/SIMD may not be that important but what should the programmer think
> about when writing optimal code with the language?

For extraction of basic data-parallel/streamable code segments, we generally
should not care whether the individual processing units are MIMD, SIMD, etc.
The only difference is in the type and amount of synchronization code that
must be inserted by the compiler to keep everything working smoothly.  As
far as the programmer is concerned, they just need to write code with lots
of very data-parallel loops -- and be able to make this parallelism clear.
The language should just be able to analyze their code in a simple manner,
detect a lack of dependencies, and split the loops into parallel segments.
We just need to make sure that the language has enough hooks to indicate to
a fairly simple compiler where the parallel loop bodies/kernels are located.
While a complicated compiler could probably perform most or all of the
necessary analysis itself, I think that some parallelism hints buried in the
language will probably be necessary in the end.

On the other hand, if we take advantage of parallelism by allocating
different tasks to the different processors, such as different "pipeline
stages" of an algorithm, then we must clearly have MIMD machines.  We may
also need more explicit program language specification for these portions of
code.  More on that below, however.

> Compiler:
> What should the compiler be responsible for?

I personally feel that it should be responsible for as much as it possibly
can, in order to allow code written in our language to be as portable as
possible.  It's our goal to figure out what's reasonable for the compiler to
figure out, and put in just enough language features to reach that point.
Put another way, the real question is what must the programmer explicitly
encode in an application in order to make it easy to compile?

> Stream memory management

I feel that most memory handling among the abstract layers should be fairly
automatic.  Compilers today can handle register allocation, and the
management of the multilayered memory in a streaming machine (as described
above) is largely just a larger-scale exercise in register allocation, since
the compiler is now also responsible for allocation of an additional layer
of local memory.

Because local, on-chip memories come in all shapes and sizes, leaving this
task to the compiler is probably the only way to get somewhat optimal object
code on multiple machines with the same source code, since the optimal
degree of data blocking and stripmining will no doubt vary with memory size
and architecture.  These techniques have been used by compilers for years to
divide up code into vectors and chunks for parallel processors, so I don't
think compilers will have many problems, as long as we do simple things like
eliminating C pointer aliasing effects.

Of course, putting in a few helpful hints (similar to the "register"
notation in ANSI C) can still be allowed in order to give the programmer
more control over this aspect of compilation.

> Recognize temporary intermediate streams

This should not be too difficult for a compiler, since it is mostly a case
of variable liveness analysis writ large.  If a stream -- or portion of a
stream -- isn't referred to in the "future," before it is deallocated or
goes out-of-scope, then it's clearly temporary and does not need to be
"stored" to any levels of memory beyond those necessary to hold it during
the time it is "live."

> Kernel collapsing (by this, I'm assuming Ian means combining small kernels)
> Blocking of computation (and this is "splitting" big kernels)

A compiler should be able to handle this, by and large.  Most decisions
about how to how to divide up kernels will simply be made by adding up how
much memory is needed, and then merging or splitting the "kernels" (inner
loop bodies in source code) to use the available registers and/or local
memory.  We will have to do some compiler work to figure out exactly what
the limits of compiler analysis are, but since the problem is very similar
to blocking and strip mining on vector machines I don't think that it will
be a serious impediment.

The only "funny" part that I can see here is that several different memory
use patterns may appear fairly "optimal," especially if a compiler is
designed to try tricks like loop fusion and/or loop nest reordering to
generate different memory access patterns.  However, a compiler could simply
generate all possible cases and then pick out the best by executing them all
and profiling to find the best results.

> What can a kernel routine do?  Function calls, Conditionals, etc.

This is a tough question.  While all of the memory allocation issues are
independent of the type of parallel processor architecture used, these
aspects are not.  Let's just look at the extremes:

1) No calls, conditionals, etc. allowed: This side is obviously easy to
compile for on any architecture, but the limited flexibility may cramp the
programming style of many programmers.  While this may be for the best on
architectures that rely mostly on SIMD, it could possibly make MIMD code
longer and more complex than necessary.

2) Everything allowed: This is a much harder compilation job, but it allows
code to be written in a more natural style.  I personally feel that this is
the better option IF we can write a fairly efficient code generator to
handle some of the tougher, SIMD compilation targets.  I would consider this
a good research topic, and it will be one of the tougher aspects of the
compilation job.

In reality, we will probably start with a constrained model, like #1, and
then slowly expand towards #2 as much as possible.

> General scheduling of kernels

>From the standpoint of DSP-C, I'm mostly of the viewpoint that one should
just take the equivalent ANSI C code and start a kernel whenever the loop
whose function it encapsulates would be executed during the normal course of
the overall program.  Of course, this is a fairly simple and low-level
scheduling technique, that assumes that "serial" portions of the program
between kernels are executed on a central controlling processor.  I'm open
to higher-level kernel scheduling language constructs in our "general"
language, but I would also be quite happy to see them relegated to
higher-level, more application-specific languages where the design of these
higher-level ways to schedule kernels may be more clear.

> What Stream I/O operations should be allowed?

I personally feel that you need to *allow* pretty much any sort of stream
access patterns in the language.  There will always be some funny cases when
someone will absolutely have to use a really obnoxious stream indexing
function.  You need to allow that usage to be expressed in some way, even if
you do it with the caveat that it will only run REALLY SLOWLY -- probably
without taking advantage of the architecture's streaming features at all.

The real question is not what is *allowed,* but what different "performance
enhancing access modes" we support within high-speed kernels.  In other
words, we need to make sure that the compiler can recognize (either
automatically or by programmer-supplied hints) when a simple access pattern
such as a strided or nearest-neighbor is being used, and then generate
optimized stream code that takes advantage of the additional speed that the
architecture can provide when accessing those data patterns.  I feel that
even without special syntax a compiler should be able to spot at least
simple patterns and extract them, but we may need to provide some additional
operators to tell the compiler when more complex access patterns may be
occurring that may still be optimized to some extent using stream accesses.

> Semantics:
> There are still all sorts of semantic questions.
> Do we want allow conditional reads or use the function metaphor?

I don't have too many opinions about fairly high-level semantic issues such
as this, since most of my DSP-C work has been at the low, "high-level
assembly" level.  From my point of view, these may be issues that I would
rather punt to a higher-level, more application-specific language.

While we're on the topic, however, we want to provide hooks or support for
some higher-level yet non-standard forms of programming, such as those
described in StreaMIT.  For example, I've thought of adding a special "time"
dimension to the CAAs in DSP-C to allow the compiler to help in buffering
and blocking of dataflow over time in pipelined algorithms.

A question that we should ask is what "level" various language features fall
into, and figure out which ones merit inclusion in any final language that
we create . . . and which ones should be relegated to related but separate
application-specific or hardware-specific domains instead.  If we try to
include too many things in a single "level" of the language, we're apt to
get a messy, C++-esque beast of a final language definition.

> Applications:
> What applications are critical for the streaming language?

I think that this has been covered pretty well already.  We need to analyze
the usual suspects among signal processing, multimedia, and graphics
applications, of course.  We can also throw in a few scientific
applications, too, if desired.  I'm not picky, and I'll be happy to code
some of them up.


Whew!  I think that pretty much covers where I'm coming from here.

Lance