The current implementation

English description

The algorithm takes a stream of triangles as input and considers a single triangle at a time. On each loop through the algorithm, it outputs a pixel covered by the current triangle. If it finishes with that triangle, it fetches the next one from the triangle stream. It continues until the triangle stream is exhausted.

Now let's look at what happens in each loop. We are currently using a scanline algorithm. At all times, we know the current triangle and the y-coordinate of the current scanline.

On each loop we decide if we need a new triangle or not, and bring in a new one if necessary. We decide if we need to move to the next scanline or not and increment the current scanline if so. Finally, we output the current pixel.

What this means is that we're actually evaluating three different calculations each time through the loop:

Prepped triangle

We're allowed to do a lot of work per triangle to prep it in the way we want it. Our prepped triangles have the following:

Pseudocode

loop until done {
  if we need a new triangle, bring next triangle into new_tri
  tri = need_new_triangle ? new_tri : tri;
  ycurrent = need_new_triangle ? y0 : ycurrent;

  e0span = distance between spanning edge and edge 0 on current scanline;
  e1span = distance between spanning edge and edge 1 on current scanline;
  pick whichever one is smaller;

  // we could do this another way: keep track of only one edge (e0),
  // and then switch to the next edge (e1) when we reach v2. However,
  // that way takes less math and more control decisions. We would
  // rather eliminate control decisions and do more math - that's more
  // suited for this processor.

  // all spans go from the spanning edge to the other edge
  // to handle shadow edges properly is no easy task since spans can
  // go either left or right
  determine the start and end for the current span;
  set xstart, xcurrent, xend depending on if we need a new span or not;
  clip to screen width if necessary;
  
  (x, y) of current pixel is (xcurrent, ycurrent);
  current pixel is valid if (ycurrent <= y1) && (xcurrent <= xend)
  set need_new_span = (xcurrent > xend);
  set ycurrent = need_new_span ? ycurrent + 1 : ycurrent;
  set need_new_triangle = (ycurrent > y1);
  set need_new_span = need_new_span | need_new_triangle;
}

Real code

xyrast_kc.cpp

Why is this slow?

This algorithm is slow primarily because of the long string of control flow. Note that at the start of the loop we need to know if we are bringing in a new triangle. But we don't calculate that until the very end of the previous loop. This means that we cannot software-pipeline efficiently, and software pipelining is really the key to being able to get good performance on Imagine.