Current implementation

The current implementation

English description

The algorithm takes a stream of triangles as input and considers a single triangle at a time. On each loop through the algorithm, it outputs a pixel covered by the current triangle. If it finishes with that triangle, it fetches the next one from the triangle stream. It continues until the triangle stream is exhausted.

Now let's look at what happens in each loop. We are currently using a scanline algorithm. At all times, we know the current triangle and the y-coordinate of the current scanline.

On each loop we decide if we need a new triangle or not, and bring in a new one if necessary. We decide if we need to move to the next scanline or not and increment the current scanline if so. Finally, we output the current pixel.

What this means is that we're actually evaluating three different calculations each time through the loop:

Output the first pixel of the first scanline of a new triangle.
Output the first pixel of the next scanline of the current triangle.
Output the next pixel on the current scanline of the current triangle.

Prepped triangle

We're allowed to do a lot of work per triangle to prep it in the way we want it. Our prepped triangles have the following:

We start by sorting the triangle. The three vertices are v0, v1, and v2. v0 is the lowest vertex (smallest y), v1 the highest.
Each edge is then described by a (x, delta) pair. x is x(y0) for that edge. delta is dx/dy. This allows incremental calculation to move from one scanline to the next. All of this calculation is floating-point.
In the prepped triangle, we also keep integer values for y0, y1, and y2.
The spanning edge is the edge between v0 and v1. We keep a bit saying whether the spanning edge is on the left or the right hand side of the triangle, as those two cases are handled slightly differently.
The edge between v0 and v2 is e0 (it's the lower edge), and between v2 and v1 is e1 (the upper edge).

Pseudocode

loop until done {
  if we need a new triangle, bring next triangle into new_tri
  tri = need_new_triangle ? new_tri : tri;
  ycurrent = need_new_triangle ? y0 : ycurrent;

  e0span = distance between spanning edge and edge 0 on current scanline;
  e1span = distance between spanning edge and edge 1 on current scanline;
  pick whichever one is smaller;

  // we could do this another way: keep track of only one edge (e0),
  // and then switch to the next edge (e1) when we reach v2. However,
  // that way takes less math and more control decisions. We would
  // rather eliminate control decisions and do more math - that's more
  // suited for this processor.

  // all spans go from the spanning edge to the other edge
  // to handle shadow edges properly is no easy task since spans can
  // go either left or right
  determine the start and end for the current span;
  set xstart, xcurrent, xend depending on if we need a new span or not;
  clip to screen width if necessary;
  
  (x, y) of current pixel is (xcurrent, ycurrent);
  current pixel is valid if (ycurrent <= y1) && (xcurrent <= xend)
  set need_new_span = (xcurrent > xend);
  set ycurrent = need_new_span ? ycurrent + 1 : ycurrent;
  set need_new_triangle = (ycurrent > y1);
  set need_new_span = need_new_span | need_new_triangle;
}

Real code

xyrast_kc.cpp

Why is this slow?

This algorithm is slow primarily because of the long string of control flow. Note that at the start of the loop we need to know if we are bringing in a new triangle. But we don't calculate that until the very end of the previous loop. This means that we cannot software-pipeline efficiently, and software pipelining is really the key to being able to get good performance on Imagine.