From: Lance Hammond [lance@leland.stanford.edu] Sent: Thursday, October 04, 2001 1:31 AM To: Ian Buck; Ujval Kapasi; Bill Mark Subject: Streaming from the Viewpoint of DSP-C Guys: As requested, here are my opinions about some of the features that a streaming language needs to have and some of the problems that will have to be overcome in the development of such a language. If you've already read my writings on DSP-C, a lot of the following won't be particularly new and earthshaking. If not, then you might want to look them up and read / skim them (they are now available at http://www-lance.stanford.edu/Public/ ). I basically believe that we should make a language that allows pretty low-level access to individual operations performed within the processors, much like C does today (a sort of "high-level assembly" that is *very* portable from architecture to architecture). On the other hand, in order to take advantage of streaming-style memory systems the language needs to be *somewhat* more explicit about the memory model, an area which is completely abstracted away with most varieties of C other than the "register" specifier for variables. Compilation to modern streaming-style memory hierarchies is just too complex for such a simple model. A fair amount of abstraction is still required, however, in order to ease programming and prevent us from becoming too hardware-specific. I am a little religious on this point because I feel that any language we do should be fairly architecture neutral, except for some occasional "pragma"-type option flags that may be scattered throughout the code. Otherwise, we risk becoming just a programming language-of-the-month attached to an architecture project or two, and that's not something I'm interested in. Of course, the more abstract and high-level the language is, the more the compiler must be able to analyze in order to generate efficient code. I'm personally no big fan of compilers, but I do appreciate the fact that they allow me to run the same code on an x86, a MIPS machine, and sometimes even my Macintosh. As a result, my main goal is to create a language that has the *minimum* amount of extra stuff necessary, beyond a baseline of ANSI C, to allow a relatively simple compiler to generate fairly good code on a streaming-type machine (or a DSP processor, normal processor with multimedia extensions, etc.). In the remainder of this email, I've tried to answer the questions Ian mentioned, although I've reordered them somewhat. > Hardware Model: > What are the underlying assumptions about the hardware model? In the design of DSP-C, I concluded that an abstract model consisting of one or more processors (which I have not defined in detail -- I regard most internal processor differences as something that should NOT be reflected in the language itself) connected to a three-level abstract memory system, which I view as the crucial portion of various machines that must be abstracted away: Processor registers: This very small chunk of "memory" attached to each processor is mostly used to hold temporary values for a few cycles, facilitate software pipelining, etc. Small variables or parts of variables (such as a few values stripmined out of an array) may also be register allocated, when possible. Except for the array stripmining, a function which could have significant effect on an architecture with a lot of registers like Imagine, this level is treated by a compiler just like registers in conventional processors today. Local compiler-controlled memory: These on-chip memories, each associated with one or more processors, are an addition to the "normal" compiler view of the world. From a compiler perspective, these small memories (memory mats in Smart Memories, the SRF in Imagine, etc.) are treated as big register files that can contain arrays -- or very large stripmined-out sections of arrays -- instead of individual values. On most machines, these local memories may be randomly accessed, but in the case of Imagine they may not. In effect, they are a software-controlled cache (for arrays only, in the Imagine case). The compiler is then responsible for doing register-file style allocation within this local memory (as the Imagine project has already done) and scheduling "loads" and "stores" to and from this memory (which the Imagine folks have done by hand up to this point). Main memory: Large, off-chip memories are abstracted together into a single layer that just looks like an "infinite" sized memory to hold all of the data needed by the program. This is just like the conventional compiler view of main memory, except that it is accessed primarily by special array/vector/stream accesses that will probably be handled with DMA-style hardware. For very large systems, this layer could be further split up into "local main memory" and "remote main memory," with the compiler trying to allocate data to minimize remote memory use as much as possible using techniques like those used in compilers from the 1990s such as SUIF. Interprocessor communication can be abstracted (from the compiler's point of view, at least) as reads and writes to "shared memory." In different systems, this "shared memory" could fall into any of the three different layers, and will require varying amounts of code generation. Explicit inter-cluster communication in Imagine looks essentially like a set of "shared registers," while processors that share a local and/or main memory bank may obviously pass data from one to another via loads and stores to and from the actual shared memory. > MIMD/SIMD may not be that important but what should the programmer think > about when writing optimal code with the language? For extraction of basic data-parallel/streamable code segments, we generally should not care whether the individual processing units are MIMD, SIMD, etc. The only difference is in the type and amount of synchronization code that must be inserted by the compiler to keep everything working smoothly. As far as the programmer is concerned, they just need to write code with lots of very data-parallel loops -- and be able to make this parallelism clear. The language should just be able to analyze their code in a simple manner, detect a lack of dependencies, and split the loops into parallel segments. We just need to make sure that the language has enough hooks to indicate to a fairly simple compiler where the parallel loop bodies/kernels are located. While a complicated compiler could probably perform most or all of the necessary analysis itself, I think that some parallelism hints buried in the language will probably be necessary in the end. On the other hand, if we take advantage of parallelism by allocating different tasks to the different processors, such as different "pipeline stages" of an algorithm, then we must clearly have MIMD machines. We may also need more explicit program language specification for these portions of code. More on that below, however. > Compiler: > What should the compiler be responsible for? I personally feel that it should be responsible for as much as it possibly can, in order to allow code written in our language to be as portable as possible. It's our goal to figure out what's reasonable for the compiler to figure out, and put in just enough language features to reach that point. Put another way, the real question is what must the programmer explicitly encode in an application in order to make it easy to compile? > Stream memory management I feel that most memory handling among the abstract layers should be fairly automatic. Compilers today can handle register allocation, and the management of the multilayered memory in a streaming machine (as described above) is largely just a larger-scale exercise in register allocation, since the compiler is now also responsible for allocation of an additional layer of local memory. Because local, on-chip memories come in all shapes and sizes, leaving this task to the compiler is probably the only way to get somewhat optimal object code on multiple machines with the same source code, since the optimal degree of data blocking and stripmining will no doubt vary with memory size and architecture. These techniques have been used by compilers for years to divide up code into vectors and chunks for parallel processors, so I don't think compilers will have many problems, as long as we do simple things like eliminating C pointer aliasing effects. Of course, putting in a few helpful hints (similar to the "register" notation in ANSI C) can still be allowed in order to give the programmer more control over this aspect of compilation. > Recognize temporary intermediate streams This should not be too difficult for a compiler, since it is mostly a case of variable liveness analysis writ large. If a stream -- or portion of a stream -- isn't referred to in the "future," before it is deallocated or goes out-of-scope, then it's clearly temporary and does not need to be "stored" to any levels of memory beyond those necessary to hold it during the time it is "live." > Kernel collapsing (by this, I'm assuming Ian means combining small kernels) > Blocking of computation (and this is "splitting" big kernels) A compiler should be able to handle this, by and large. Most decisions about how to how to divide up kernels will simply be made by adding up how much memory is needed, and then merging or splitting the "kernels" (inner loop bodies in source code) to use the available registers and/or local memory. We will have to do some compiler work to figure out exactly what the limits of compiler analysis are, but since the problem is very similar to blocking and strip mining on vector machines I don't think that it will be a serious impediment. The only "funny" part that I can see here is that several different memory use patterns may appear fairly "optimal," especially if a compiler is designed to try tricks like loop fusion and/or loop nest reordering to generate different memory access patterns. However, a compiler could simply generate all possible cases and then pick out the best by executing them all and profiling to find the best results. > What can a kernel routine do? Function calls, Conditionals, etc. This is a tough question. While all of the memory allocation issues are independent of the type of parallel processor architecture used, these aspects are not. Let's just look at the extremes: 1) No calls, conditionals, etc. allowed: This side is obviously easy to compile for on any architecture, but the limited flexibility may cramp the programming style of many programmers. While this may be for the best on architectures that rely mostly on SIMD, it could possibly make MIMD code longer and more complex than necessary. 2) Everything allowed: This is a much harder compilation job, but it allows code to be written in a more natural style. I personally feel that this is the better option IF we can write a fairly efficient code generator to handle some of the tougher, SIMD compilation targets. I would consider this a good research topic, and it will be one of the tougher aspects of the compilation job. In reality, we will probably start with a constrained model, like #1, and then slowly expand towards #2 as much as possible. > General scheduling of kernels >From the standpoint of DSP-C, I'm mostly of the viewpoint that one should just take the equivalent ANSI C code and start a kernel whenever the loop whose function it encapsulates would be executed during the normal course of the overall program. Of course, this is a fairly simple and low-level scheduling technique, that assumes that "serial" portions of the program between kernels are executed on a central controlling processor. I'm open to higher-level kernel scheduling language constructs in our "general" language, but I would also be quite happy to see them relegated to higher-level, more application-specific languages where the design of these higher-level ways to schedule kernels may be more clear. > What Stream I/O operations should be allowed? I personally feel that you need to *allow* pretty much any sort of stream access patterns in the language. There will always be some funny cases when someone will absolutely have to use a really obnoxious stream indexing function. You need to allow that usage to be expressed in some way, even if you do it with the caveat that it will only run REALLY SLOWLY -- probably without taking advantage of the architecture's streaming features at all. The real question is not what is *allowed,* but what different "performance enhancing access modes" we support within high-speed kernels. In other words, we need to make sure that the compiler can recognize (either automatically or by programmer-supplied hints) when a simple access pattern such as a strided or nearest-neighbor is being used, and then generate optimized stream code that takes advantage of the additional speed that the architecture can provide when accessing those data patterns. I feel that even without special syntax a compiler should be able to spot at least simple patterns and extract them, but we may need to provide some additional operators to tell the compiler when more complex access patterns may be occurring that may still be optimized to some extent using stream accesses. > Semantics: > There are still all sorts of semantic questions. > Do we want allow conditional reads or use the function metaphor? I don't have too many opinions about fairly high-level semantic issues such as this, since most of my DSP-C work has been at the low, "high-level assembly" level. From my point of view, these may be issues that I would rather punt to a higher-level, more application-specific language. While we're on the topic, however, we want to provide hooks or support for some higher-level yet non-standard forms of programming, such as those described in StreaMIT. For example, I've thought of adding a special "time" dimension to the CAAs in DSP-C to allow the compiler to help in buffering and blocking of dataflow over time in pipelined algorithms. A question that we should ask is what "level" various language features fall into, and figure out which ones merit inclusion in any final language that we create . . . and which ones should be relegated to related but separate application-specific or hardware-specific domains instead. If we try to include too many things in a single "level" of the language, we're apt to get a messy, C++-esque beast of a final language definition. > Applications: > What applications are critical for the streaming language? I think that this has been covered pretty well already. We need to analyze the usual suspects among signal processing, multimedia, and graphics applications, of course. We can also throw in a few scientific applications, too, if desired. I'm not picky, and I'll be happy to code some of them up. Whew! I think that pretty much covers where I'm coming from here. Lance