Sent: Thursday, October 04, 2001 3:32 PM
To: lance@leland. Stanford. EDU; Ujval Kapasi; Billmark@Lambert.
Stanford. Edu
Subject: Streaming Languages


Here are some of my thoughts on streaming.

----------------------------------------------------------------
Hardware Model:
  What are the underlying assumptions about the hardware model?

 First, the main reasons for doing "streaming" computation:
 - Data parallelism.  Each element in the input stream is computed
separately without dependencies to other elements in the stream.
 - Task parallelism.  Each kernel can start executing as soon as its inputs
are ready.

  So the question becomes how to express this parallelism in a program.
Explicit parallel programming, like thread libraries or hardware dependent
languages, can be quite OS and hardware specific.  Rather the language
should, as a goal, abstract the hardware so that the same program runs with
excellent efficiency on Intel, SmartMemories, SSC, etc.

  The kernel/stream model used by Imagine nicely makes explicit
communication and computation.  The software designer has to think not only
about the calculations each kernel is doing, but also the data flow which
may dictate performance.

  In terms of a hardware model for the Streaming Language, I would adapt the
stream kernel model of Imagine.  While exactly what constitutes a stream or
what a kernel is capable of computing may be totally different than Imagine,
dividing computation and communication is key for obtaining parallelism.  As
stated above, no assumptions should be made about the number of functional
units, memory higharchy, or other hw specific details.  At the same time,
the application designer should assume that the compiler/system will
schedule their kernels in the most efficient manner for the hardware.  The
design of the language SHOULD make this doable for the compiler.

---------------------------------------------------------------------------
 Compiler:

  My view of the compiler is that of a beefy pattern matcher with a large
system library which can be inlined.  It can recognize code snippets that
map to a particular system library call and can properly munge the code into
calling into the system library.  Therefore, any algorithms that can be
coded in C and easily recognizable are things that the application writer
should assume the compiler is performing those optimizations.  Things like
kernel collapsing, blocking computation, and general scheduling of kernels
are things that the compiler should be able to do.  One thing to note is
that general scheduling of kernels is dependent not only the communication
between kernels (one kernel can't start before the other has outputted it's
inputs), but the size of each kernel.  Large kernels take longer, therefore
you want to allocate more resources to them so that they can keep up with
the smaller faster kernels.

  Stream memory management
  Clearly on different machines, there may be very different stream
management schemes.  The system library should be able to do this for the
appropriate hardware.

  Recognize temporary intermediate streams
  This is a bit harder.  It is quite important since we want to be saving
bandwidth between chips.  Dawson has already shown that is compiler can
figure out usage of variables and in many ways this is like garbage
collection in Java.  So currently, I'm putting this doable category but it
might be kinda hard.  If we decide that we don't want to rely on the
compiler for this one, we can add a "temporary" or "local" keyword.

-----------------------------------------------
 What Stream I/O operations should be allowed?

 In this area, I would argue that its better to error on the side of more
complete than being restrictive.  Someone is always going to want to do
something you don't support.  This means I would allow full indexing, along
with relative addressing, and list or set-like computation.

  However these different access patterns come with different costs.  For
example, full indexing requires that the array be fully defined before any
kernels perform lookups into it.  That bones parallelism.

  Therefore I would emphasize that the programmer declare which type of
access will be performed on the different inputs.  This way, there is an
understanding that they are requesting an access pattern that might be
expensive.

  Clearly there will be need for relative and full indexing.  The things I'm
not sure about are:

  Strides?  Is this needed for matrix solves?
  Generalized walk directions?  Do sci applications have regular access
patterns?


----------------------------------------------------------------------
 Semantics:

  Here I would vote for the C style code.  People are familiar with it. It
will make porting code easy.  We can leverage existing compilers via
metacompilation.

  I'm also a function metaphor fan instead of conditional reads.  With this
model, a function is executed when all the inputs are ready to go.  This is
simple for the programmer to understand and fits nicely with the data
parallelism model:  Function F() is applied to each element in stream S.
The question remains, do we need conditional inputs?  John Owens mentioned a
merge sort algorithm that would benefit from it.  We should look at existing
Imagine aps that use it and see if they could be ported.

  As for what a kernel can do, I would say that it should be able to do any
mathematical operation written out in a C style.  Different hardware are
going to have different computational units so there is no point in limiting
ourselves here.  Likewise, I would allow conditionals and looping.

  Since we are targeting scientific computing, I would assume a vector
library which should allow operations like:
  vec3 a (1, 0, 0);
  mat3 b (1, 0, 0, 0, 1, 0, 0, 0, 1); //identity
  vec3 c = b*a;

  In defining streams, keeping close to the C model can help make stream
usage understandable to new users.  Streams can be declared like structs:

stream Color {
	float r;
	float g;
	float b;
  }

Color fragments;  // A stream set or list
Color framebuffer [1024][1024]; // An array


Kernels then could be declared as functions which take stream's as inputs:

stream Vertex {
  vec4 v;
}

void
vtransform (Vertex vtx, Vertex tvtx, const mat4 matrix) {
  tvtx.v = matrix * vtx.v;
}

There are a lot more details to be discussed here but I would like to
emphasize the C-model as a possible clean solution.

------------------------------------------------------------
 Applications:
  What applications are critical for the streaming language?

  Basic programs are interesting but we really need to test the robustness
of the language with hard core scientific calculations.  Stuff like Ron's
water simulations, or the mach calculations over an aircraft wing.  They
really need to go fast with these in order for this to be useful.

-------------------------------------------------------------
  Other issues?

  Here are some other issues that I think we need to address eventually:

  Stream Casting? Can we cast a stream of vertexes to stream of triangles?
Can I cast an array to a regular list and back?

  Counter Streams? Streams which are defined as 1, 2, 3, ....

  Graph operations?  Some scientific computing is done over a mesh.  How do
we represent this?

  Ordering.  Where is this specified?  Should we make basic assumptions
about ordering and offer keywords to do otherwise?

  How do we handle overflow?  There may be limits on the kernel size. Do we
punt or break the computation?  Furthermore, we have a limit on the
memories. Do we punt or overflow?

  Do we allow higher order primitives?  Array of Streams? Streams of
Streams?  You could implement graphs and other things with these constructs,
but what does this do to the hardware model?