From: Ujval Kapasi [ujk@Stanford.EDU] Sent: Thursday, October 04, 2001 3:25 PM To: Ian Buck; Bill Mark; Lance Hammond Cc: Mark Horowitz Subject: stream thoughts Here are my thoughts. This is more of a brain dump than a document/memo. -- ujval > What Stream I/O operations should be allowed? > - Relative Addressing (arrays)? > - Full Indexing? > - Strides? > How useful are these for stream computing? > How bad/hard are these for hardware? I think the current methodology for this stuff that we use in StreamC is pretty good. Basically, you can declare a basic stream, and then derive one or more streams from it with any type of indexing you want (i.e., fully indexed, strided, etc.) I also think all memory references within a kernel should be specifically declared as a derived fully indexed stream. I feel that it would be too much effort to get a compiler to extract the memory references automatically from a kernel. Even on an underlying architecture where arbitrary memory references can be made within a kernel, it's probably useful to separate out the fully indexed streams because it will help with things like prefetching. One very important aspect of streams that we should think about is their length. I know that the StreaMIT people have run into some trouble by not making length a basic property of a stream. I think the programmer should have to specify the lengths of all streams that are indeed fixed-length. An acceptable length could be 'infinity' in order to handle I/O from some network peripheral in a clean way. Another accepted length could be 'variable' with min and optional max specifiers. > Compiler: > What should the compiler be responsible for? > Should the compiler do things like: > Stream memory management > Kernel collapsing > Recognize temporary intermediate streams > Blocking of computation > General scheduling of kernels > Where do each of these fall in the easy-hard scale? > If the compiler doesn't do it, how is it expressed? I think the smaller the granularity at which you specify a kernel the better. This will allow the compiler the most flexibility when dealing with kernels. I think collapsing smaller kernels into bigger ones is a lot easier for the tool than trying to dissect larger kernels into constituent pieces. If the tool can handle collapsing kernels reasonably well, the language can force the programmer to split a kernel in order to expose streams which used would've been temporary data within a larger kernel. This is important for dealing with fully indexed memory streams (i.e., stream accesses that actually require going to the memory controller), and for intercluster communication of data that is produced within a kernel. Defining and using smaller kernels might be slightly more tedious for the programmer, but perhaps allowing nesting of kernels will alleviate much of this tedium. Taking this idea a step further, we may want hierarchical streams also. (I haven't thought about this one too much yet... I think I need to try and apply this idea to some apps before I can make a judgement on whether this is something we should even consider supporting in the language and compiler.) The basic idea is that you can define a stream composed of elements of a complex and/or large data type, such as an image. Then, you apply some kernel(s) to this stream of images. Within these highest-level kernels, you might derive simpler streams, such as a stream of rows, and then send these streams through some second-level kernels. For example, you might nest streams & kernels in this way until you get to the granularity of individual color components of a pixel if you are doing some very low level operation. This would be a way to expose a lot of important information about the locality & required communication in an app to the compiler, however there's a strong possibility that it's not possible to represent any even moderately complicated application in this format in a clean fashion. The compiler should automatically handle software pipelining of applications. Imagine's kernel scheduler already deals with this very well for kernels, but we need to software pipeline stream operations also. How much of this will be run-time scheduling in hw/sw and how much will be static is up in the air. Peter has done a lot of work on this in his StreamC tools, and ideally we should be able to leverage his algorithms. Stripmining is a little more complicated. I would love it if someone came up with a reliable and efficient way to do this automatically. This is not so bad if your underlying architecture is the usual general purpose micropocessor, but this can get kind of hairy for certain kernels on an Imagine-esque architecture. Perhaps having an efficient dumping mechanism would make this do-able by an automated tool. (All this applies to 'blocking' as well.) Finally, an open problem that we are definately going to have to solve is to get the compiler tool to efficiently map stream programs to mutliple-node configurations. We don't want to have to use the typical SPMD programming methodology for these supercomputer apps. For example, the tool should be able to automatically pipeline consecutive kernels across different nodes. Again, how much of this is done at run-time vs. compile-time is up in the air. Also up in the air is how much human hand-holding is required/reasonable for this problem. In any case, I am for supporting as much of this as possible in the compiler, and hence exposing as little of this aspect to the programmer as possible. > Semantics: > There are still all sorts of semantic questions. > Do we want allow conditional reads or use the function metaphor? > What can a kernel routine do? Function calls, Conditionals, etc. Bill (D) commented that the function metaphor might present trouble for memory management. There needs to be enough explicit sequencing of the program in order to avoid having to allocate multiple copies of the same data structure. For example, an Imagine kernel has a main loop which sequences all of the iterations. However, the more sequencing we add the more parallelism we are potentially sacrificing. I don't think conditional streams are something we need to necessarily expose in this language. Depending on the hardware implementation, however, conditional streams could be an intermediate target for the compiler tools. Function calls are a little trickier. I'm pretty sure we don't want to support multiple program counters and have a full-fledged MIMD set of clusters. The SIMD model maps to efficient hardware, and is not far from the model required by most media apps we've looked at so far. My hunch is that scientific apps fall nicely into this category as well. > Hardware Model: > What are the underlying assumptions about the hardware model? > MIMD/SIMD may not be that important but what should the programmer think > about when writing optimal code with the language? I've been assuming a hardware model largely based on the current Imagine architecture. While the current Imagine architecture is pretty raw, it does a good job of only including absolutely necessary hardware functions & communication paths. I think it's possible in our language to abstract away most of the gory bookkeeping necessary for implementing on a SIMD machine. But we'll have to constrain the programmer to force them to structure programs to deal with the remaining hardware 'features.' Thus they may not have to think about the actual underlying hardware, but the language spec will force them to abide by whatever restrictions are present due to the hardware. An example of this having to declare temporary streams as an explicit stream between two hierarchical kernels. Also, not every portion of an application should be executed on Imagine-like arithmetic clusters. There are control intensive portions of an application that should be implemented on a RISC-like processor instead, and expressing these portions in a streaming language might be awkward, and is unnecessary. We should think about how we are going to interface these code segments with the main arithmetic-intensive streaming portions in our language. -- ujval