From: Ujval Kapasi [ujk@Stanford.EDU]
Sent: Thursday, October 04, 2001 3:25 PM
To: Ian Buck; Bill Mark; Lance Hammond
Cc: Mark Horowitz
Subject: stream thoughts

Here are my thoughts.  This is more of a brain dump than a
document/memo.

-- ujval

 >  What Stream I/O operations should be allowed?
 >    - Relative Addressing (arrays)?
 >    - Full Indexing?
 >    - Strides?
 >    How useful are these for stream computing?
 >    How bad/hard are these for hardware?

I think the current methodology for this stuff that we use in StreamC
is pretty good.  Basically, you can declare a basic stream, and then
derive one or more streams from it with any type of indexing you want
(i.e., fully indexed, strided, etc.)  I also think all memory
references within a kernel should be specifically declared as a
derived fully indexed stream.  I feel that it would be too much effort
to get a compiler to extract the memory references automatically from
a kernel.  Even on an underlying architecture where arbitrary memory
references can be made within a kernel, it's probably useful to
separate out the fully indexed streams because it will help with
things like prefetching.

One very important aspect of streams that we should think about is
their length.  I know that the StreaMIT people have run into some
trouble by not making length a basic property of a stream.  I think
the programmer should have to specify the lengths of all streams that
are indeed fixed-length.  An acceptable length could be 'infinity' in
order to handle I/O from some network peripheral in a clean way.
Another accepted length could be 'variable' with min and optional max
specifiers.

 >  Compiler:
 >   What should the compiler be responsible for?
 >   Should the compiler do things like:
 > 	Stream memory management
 > 	Kernel collapsing
 > 	Recognize temporary intermediate streams
 > 	Blocking of computation
 > 	General scheduling of kernels
 >   Where do each of these fall in the easy-hard scale?
 >   If the compiler doesn't do it, how is it expressed?

I think the smaller the granularity at which you specify a kernel the
better.  This will allow the compiler the most flexibility when
dealing with kernels.  I think collapsing smaller kernels into bigger
ones is a lot easier for the tool than trying to dissect larger
kernels into constituent pieces.  If the tool can handle collapsing
kernels reasonably well, the language can force the programmer to
split a kernel in order to expose streams which used would've been
temporary data within a larger kernel.  This is important for dealing
with fully indexed memory streams (i.e., stream accesses that actually
require going to the memory controller), and for intercluster
communication of data that is produced within a kernel.  Defining and
using smaller kernels might be slightly more tedious for the
programmer, but perhaps allowing nesting of kernels will alleviate
much of this tedium.

Taking this idea a step further, we may want hierarchical streams
also.  (I haven't thought about this one too much yet...  I think I
need to try and apply this idea to some apps before I can make a
judgement on whether this is something we should even consider
supporting in the language and compiler.)  The basic idea is that you
can define a stream composed of elements of a complex and/or large
data type, such as an image.  Then, you apply some kernel(s) to this
stream of images.  Within these highest-level kernels, you might
derive simpler streams, such as a stream of rows, and then send these
streams through some second-level kernels.  For example, you might
nest streams & kernels in this way until you get to the granularity of
individual color components of a pixel if you are doing some very low
level operation.  This would be a way to expose a lot of important
information about the locality & required communication in an app to
the compiler, however there's a strong possibility that it's not
possible to represent any even moderately complicated application in
this format in a clean fashion.

The compiler should automatically handle software pipelining of
applications.  Imagine's kernel scheduler already deals with this very
well for kernels, but we need to software pipeline stream operations
also.  How much of this will be run-time scheduling in hw/sw and how
much will be static is up in the air.  Peter has done a lot of work on
this in his StreamC tools, and ideally we should be able to leverage
his algorithms.

Stripmining is a little more complicated.  I would love it if someone
came up with a reliable and efficient way to do this automatically.
This is not so bad if your underlying architecture is the usual
general purpose micropocessor, but this can get kind of hairy for
certain kernels on an Imagine-esque architecture.  Perhaps having an
efficient dumping mechanism would make this do-able by an automated
tool.  (All this applies to 'blocking' as well.)

Finally, an open problem that we are definately going to have to solve
is to get the compiler tool to efficiently map stream programs to
mutliple-node configurations.  We don't want to have to use the
typical SPMD programming methodology for these supercomputer apps.
For example, the tool should be able to automatically pipeline
consecutive kernels across different nodes.  Again, how much of this
is done at run-time vs. compile-time is up in the air.  Also up in the
air is how much human hand-holding is required/reasonable for this
problem.  In any case, I am for supporting as much of this as possible
in the compiler, and hence exposing as little of this aspect to the
programmer as possible.

 >  Semantics:
 >   There are still all sorts of semantic questions.
 >     Do we want allow conditional reads or use the function metaphor?
 >     What can a kernel routine do?  Function calls, Conditionals, etc.

Bill (D) commented that the function metaphor might present trouble
for memory management.  There needs to be enough explicit sequencing
of the program in order to avoid having to allocate multiple copies of
the same data structure.  For example, an Imagine kernel has a main
loop which sequences all of the iterations.  However, the more
sequencing we add the more parallelism we are potentially sacrificing.

I don't think conditional streams are something we need to necessarily
expose in this language.  Depending on the hardware implementation,
however, conditional streams could be an intermediate target for the
compiler tools.

Function calls are a little trickier.  I'm pretty sure we don't want
to support multiple program counters and have a full-fledged MIMD set
of clusters.  The SIMD model maps to efficient hardware, and is not
far from the model required by most media apps we've looked at so
far.  My hunch is that scientific apps fall nicely into this category
as well.

 >  Hardware Model:
 >   What are the underlying assumptions about the hardware model?
 >   MIMD/SIMD may not be that important but what should the programmer think
 > about when writing optimal code with the language?

I've been assuming a hardware model largely based on the current
Imagine architecture.  While the current Imagine architecture is
pretty raw, it does a good job of only including absolutely necessary
hardware functions & communication paths.

I think it's possible in our language to abstract away most of the
gory bookkeeping necessary for implementing on a SIMD machine.  But
we'll have to constrain the programmer to force them to structure
programs to deal with the remaining hardware 'features.'  Thus they
may not have to think about the actual underlying hardware, but the
language spec will force them to abide by whatever restrictions are
present due to the hardware.  An example of this having to declare
temporary streams as an explicit stream between two hierarchical
kernels.

Also, not every portion of an application should be executed on
Imagine-like arithmetic clusters.  There are control intensive
portions of an application that should be implemented on a RISC-like
processor instead, and expressing these portions in a streaming
language might be awkward, and is unnecessary.  We should think about
how we are going to interface these code segments with the main
arithmetic-intensive streaming portions in our language.

-- ujval