Sent: Thursday, October 04, 2001 3:32 PM To: lance@leland. Stanford. EDU; Ujval Kapasi; Billmark@Lambert. Stanford. Edu Subject: Streaming Languages Here are some of my thoughts on streaming. ---------------------------------------------------------------- Hardware Model: What are the underlying assumptions about the hardware model? First, the main reasons for doing "streaming" computation: - Data parallelism. Each element in the input stream is computed separately without dependencies to other elements in the stream. - Task parallelism. Each kernel can start executing as soon as its inputs are ready. So the question becomes how to express this parallelism in a program. Explicit parallel programming, like thread libraries or hardware dependent languages, can be quite OS and hardware specific. Rather the language should, as a goal, abstract the hardware so that the same program runs with excellent efficiency on Intel, SmartMemories, SSC, etc. The kernel/stream model used by Imagine nicely makes explicit communication and computation. The software designer has to think not only about the calculations each kernel is doing, but also the data flow which may dictate performance. In terms of a hardware model for the Streaming Language, I would adapt the stream kernel model of Imagine. While exactly what constitutes a stream or what a kernel is capable of computing may be totally different than Imagine, dividing computation and communication is key for obtaining parallelism. As stated above, no assumptions should be made about the number of functional units, memory higharchy, or other hw specific details. At the same time, the application designer should assume that the compiler/system will schedule their kernels in the most efficient manner for the hardware. The design of the language SHOULD make this doable for the compiler. --------------------------------------------------------------------------- Compiler: My view of the compiler is that of a beefy pattern matcher with a large system library which can be inlined. It can recognize code snippets that map to a particular system library call and can properly munge the code into calling into the system library. Therefore, any algorithms that can be coded in C and easily recognizable are things that the application writer should assume the compiler is performing those optimizations. Things like kernel collapsing, blocking computation, and general scheduling of kernels are things that the compiler should be able to do. One thing to note is that general scheduling of kernels is dependent not only the communication between kernels (one kernel can't start before the other has outputted it's inputs), but the size of each kernel. Large kernels take longer, therefore you want to allocate more resources to them so that they can keep up with the smaller faster kernels. Stream memory management Clearly on different machines, there may be very different stream management schemes. The system library should be able to do this for the appropriate hardware. Recognize temporary intermediate streams This is a bit harder. It is quite important since we want to be saving bandwidth between chips. Dawson has already shown that is compiler can figure out usage of variables and in many ways this is like garbage collection in Java. So currently, I'm putting this doable category but it might be kinda hard. If we decide that we don't want to rely on the compiler for this one, we can add a "temporary" or "local" keyword. ----------------------------------------------- What Stream I/O operations should be allowed? In this area, I would argue that its better to error on the side of more complete than being restrictive. Someone is always going to want to do something you don't support. This means I would allow full indexing, along with relative addressing, and list or set-like computation. However these different access patterns come with different costs. For example, full indexing requires that the array be fully defined before any kernels perform lookups into it. That bones parallelism. Therefore I would emphasize that the programmer declare which type of access will be performed on the different inputs. This way, there is an understanding that they are requesting an access pattern that might be expensive. Clearly there will be need for relative and full indexing. The things I'm not sure about are: Strides? Is this needed for matrix solves? Generalized walk directions? Do sci applications have regular access patterns? ---------------------------------------------------------------------- Semantics: Here I would vote for the C style code. People are familiar with it. It will make porting code easy. We can leverage existing compilers via metacompilation. I'm also a function metaphor fan instead of conditional reads. With this model, a function is executed when all the inputs are ready to go. This is simple for the programmer to understand and fits nicely with the data parallelism model: Function F() is applied to each element in stream S. The question remains, do we need conditional inputs? John Owens mentioned a merge sort algorithm that would benefit from it. We should look at existing Imagine aps that use it and see if they could be ported. As for what a kernel can do, I would say that it should be able to do any mathematical operation written out in a C style. Different hardware are going to have different computational units so there is no point in limiting ourselves here. Likewise, I would allow conditionals and looping. Since we are targeting scientific computing, I would assume a vector library which should allow operations like: vec3 a (1, 0, 0); mat3 b (1, 0, 0, 0, 1, 0, 0, 0, 1); //identity vec3 c = b*a; In defining streams, keeping close to the C model can help make stream usage understandable to new users. Streams can be declared like structs: stream Color { float r; float g; float b; } Color fragments; // A stream set or list Color framebuffer [1024][1024]; // An array Kernels then could be declared as functions which take stream's as inputs: stream Vertex { vec4 v; } void vtransform (Vertex vtx, Vertex tvtx, const mat4 matrix) { tvtx.v = matrix * vtx.v; } There are a lot more details to be discussed here but I would like to emphasize the C-model as a possible clean solution. ------------------------------------------------------------ Applications: What applications are critical for the streaming language? Basic programs are interesting but we really need to test the robustness of the language with hard core scientific calculations. Stuff like Ron's water simulations, or the mach calculations over an aircraft wing. They really need to go fast with these in order for this to be useful. ------------------------------------------------------------- Other issues? Here are some other issues that I think we need to address eventually: Stream Casting? Can we cast a stream of vertexes to stream of triangles? Can I cast an array to a regular list and back? Counter Streams? Streams which are defined as 1, 2, 3, .... Graph operations? Some scientific computing is done over a mesh. How do we represent this? Ordering. Where is this specified? Should we make basic assumptions about ordering and offer keywords to do otherwise? How do we handle overflow? There may be limits on the kernel size. Do we punt or break the computation? Furthermore, we have a limit on the memories. Do we punt or overflow? Do we allow higher order primitives? Array of Streams? Streams of Streams? You could implement graphs and other things with these constructs, but what does this do to the hardware model?