The Imagine Architecture

Here's a top level view of the Imagine architecture:

In the problem we're looking at, we begin with a stream of prepped triangles in the stream register file. The triangles are streamed through the ALU clusters, generating a stream of output pixels, which are written back to the SRF as they are produced. So all that's important for this problem is the stream register file, the microcontroller, and in particular the ALU clusters.

Note the clusters are all controlled by the same microcontroller in SIMD fashion. There are 8 clusters, so at any time, 8 triangles are processed at the same time.

Now, what's in a cluster?

Each of the 8 clusters has 3 adders (which do both integer and floating-point adds, subtracts, and logical ops); 2 multipliers (both integer and floating point); and a divide/square root unit (both an integer and floating-point divide). There's also a scratchpad register file (256 entries - used for local array lookups) and a communication unit used for sending and receiving values to other clusters. Except for the divide/square root unit, each of these functional units is fully pipelined and can issue one operation per cycle.

Clusters are VLIW controlled with one important difference: instead of one giant register file feeding all functional units, each register file input is fed by its own private register file (usually 16 entries per RF). In this way we avoid the huge area, power, and delay cost associated with large multiported register files.

Instead, outputs of functional units must be switched through the cluster switch. Scheduling this switch is not a task which should be done by hand, so writing straight machine code for complex kernels is not tractable. Instead, we write them in C++ and allow a kernel scheduler to schedule the functional operations and the associated communications.

So what's this all mean?

First, don't worry about 8 clusters. Just worry about one of them at a time, and know that there are 7 other clusters doing the same thing to another triangle.
However, the loop code has to handle all possible threads of execution. For example, you have to account for the case where you continue to work on the same triangle since it's not done as well as the case where you are finished with the current triangle and must fetch another one.
You have a lot of ALUs to throw at the problem. I would guess that the number of ops to determine one pixel would be on the order of 50 ops. Since you can issue 5 ops per cycle, a well-structured app ought to be able to do a new pixel every 10 cycles. Because the current algorithm is not very good, the loop length is 53 cycles. In 53 cycles we could actually do over 250 arithmetic ops, which is way more than the problem requires. The problem is not a lack of computation resources, but instead a poorly structured algorithm with a long critical path and interfering loop-carried state.