Rasterization Problem

Rasterization problem: Enumerate the grid points covered by a 2D screen-space triangle

Executive summary

I'm asking for your help in solving the following problem. On an architecture which does not allow control-flow branches (if-then), find an efficient algorithm to take a prepped screen-space triangle as an input and, on each loop through the algorithm, output a screen-space pixel location covered by that triangle. The algorithm should run as fast as possible and, while it can output invalid pixels and mark them invalid, it should minimize the number of invalid pixels output.

Interested? Keep reading!

Introduction
The hardware model
The loop
Loop metrics
Loop restrictions
Good things
The current implementation
Other stuff I've tried
Contact me

Frequently Asked Questions

Greetings! My name is John Owens and I have a graphics algorithm problem. This web page is a little long, but I hope it fully describes the problem. The problem is pretty simple. I would like an efficient routine which takes a screen-space triangle, with vertices at arbitrary floating-point values, as input and emits all the integer screen-space grid points covered by that triangle. Here's an example:

The points covered by this triangle are:
(3, 2); (4, 2); (2, 3); (3, 3); (4, 3); (3, 4); (4, 4); (3, 5).
(I'm using the shadow rule that points directly on a pixel which lie on the right side of a scanline are drawn, but ones on the left side are not.)

This procedure is exactly what rasterizers do. However, my problem has some important restrictions that make it difficult (and more interesting!), which I will enumerate below.

In my graphics pipeline, it's the largest single chunk of runtime, and its poor performance in identifying valid pixels leads to further inefficiencies further on in the pipeline. So I am interested in getting a better solution, which is why I'm asking your help.

The hardware model

This algorithm will be implemented on the Imagine Stream Processor. Imagine is a SIMD machine meaning that it processes several input elements (in this case, triangles) at the same time using the same instruction stream. (On Imagine, there are 8 arithmetic clusters, each with several functional units, and each input element will be processed on one of the clusters.) This leads to the following restrictions:

No if statements. The C trinary operator is OK (a ? b : c). (The reason no if statements are allowed is that the control flow for each triangle must be the same because they all use the same instruction stream. If one cluster branched one way and another branched the other, they would no longer use the same instruction stream.)
No memory references. This shouldn't pose a problem.

This isn't a very complete description of the hardware; I've put together a longer description here.

The loop

I would like to write the following loop. I begin with a long stream of triangles (100 or so). On each iteration of the loop, I must output one pixel which is covered by the current triangle. If I don't have a valid pixel to output, I am allowed to output something bogus and mark it as invalid. At the beginning of the loop, if I need a new triangle, I fetch one. Otherwise, I keep the triangle I'm working on. I loop until I am finished with all the triangles in the input stream. In pseudocode, this loop looks like:

loop over all triangles {
  bring in new triangle if we're done with the last one;
  compute one pixel which is covered by this triangle;
  output that pixel;
  output if that pixel is valid (actually covers a pixel in the triangle);
  decide if we're done with the current triangle;
}

Another way to write this loop is with the conditional input at the end of the loop:

loop over all triangles {
  compute one pixel which is covered by the current triangle;
  output that pixel;
  output if that pixel is valid (actually covers a pixel in the triangle);
  decide if we're done with the current triangle;
  if so, bring in new triangle;
}

I must output a pixel on every loop. The pixel doesn't have to be valid because each pixel is associated with a valid bit. This comes in handy in many cases, for example: I input a triangle which is very small and doesn't overlap any pixels. Because of the way my loop is structured, I have to output at least one pixel per triangle. There aren't any valid pixels for that triangle, so instead I output one pixel and mark it as invalid.

Loop metrics

There are two key metrics which are important in finding good solutions:

A loop should run as quickly as possible. So the computation should be efficient and fast.
A loop should output as few invalid pixels as possible. Invalid pixels just waste computational resources in later stages - I can't throw them away until much later in the pipeline.

It is not too hard to find algorithms which perform well on one metric but not on the other. For example, traversing the bounding box of each triangle would probably generate a very fast loop because the control complexity is small. But most of the pixels generated would be inside the triangle but outside the bounding box, and would thus be invalid.

Loop restrictions

The key restriction is that I must go directly from triangles to pixels in this loop, and we only get to visit each triangle once. Once we begin a triangle, we have to output all its pixels, one pixel per loop, until we are done. Then we can't revisit it. It would be more efficient if I were to input triangles and output spans in one loop, and then input spans and output pixels in a second step. That's what most commercial rasterizers do. I can't do that - I have to do it all in one step. Also, it means that algorithms which split the triangle in half to make half-triangles (which lend themselves to simpler control) are much more difficult to do (possible, but lots of overhead, so not worth it in overall runtime). If you want a more satisfying technical explanation, please contact me.

The restriction that we don't have any branches means that in practice, when we have multiple threads of execution, we have to evaluate all of them, and then at the end, select the right one. For example, to bring in a new triangle:

loop {
  ...
  triangle_stream(need_new_triangle) >> new_triangle;
  triangle = need_new_triangle ? new_triangle : triangle;
  ...
}

Note triangle is held as loop-carried state from one iteration of the loop to the next.

It is important to optimize for small triangles. It is tempting to suggest solutions which work better for large triangles, in particular solutions which impose a significant per-triangle cost. (One example is algorithms which pipeline the carrying of loop-carried state to avoid long loop-carried dependencies. However, this means that we always output two or more pixels per triangle, no matter how small the triangle is. I am resigned to doing one pixel for every triangle, valid or not, but don't want to be required to do more than one.)

Good things

OK, so now things are looking hard. But there are some good properties of this machine.

Lots of functional units. On each cycle of the loop, each cluster in the machine can issue 3 adds/subtracts and 2 multiplies, all either integer or floating point. So there's lots of math to throw at the problem. In general it is better on this machine to do more math and less control flow; making the math more complex in exchange for a simpler control flow is generally a win.
Triangle setup is not an issue in this loop, because we do it earlier in the pipeline. You are allowed to do (almost) unlimited setup. Currently, for example, the setup does one floating-point divide per edge in the triangle (so 3 divides) to set it up for the loop. I am willing to pay a large setup cost for a better performing loop here.
Culling is done in an earlier step. Backfacing triangles are culled. Because so many triangles are small, I also do another cull: triangles which do not cross a scanline (which fall entirely between two scanlines, in either x or y) are also culled.

The current implementation

The current implementation is pretty bad. It makes poor use of the hardware, it is slow, and it does not get good efficiency on valid pixels (about 65-70%).

It essentially uses a scanline method. We keep a triangle as loop-carried state. On each loop we decide if we need a new triangle or not, and bring in a new one if necessary. We also keep track of the current scanline as loop-carried state. We decide if we need to move to the next scanline or not and increment the current scanline if so. Finally, we output the current pixel.

What this means is that we're actually evaluating three different calculations each time through the loop:

Output the first pixel of the first scanline of a new triangle.
Output the first pixel of the next scanline of the current triangle.
Output the next pixel on the current scanline of the current triangle.

Not surprisingly, this isn't efficient. It's pretty control heavy. More information on the current implementation, including source code, can be found here.

Other stuff I've tried

Originally, we looked at traversing a bounding box of the triangle. This has the advantage that the loop is really fast because the control complexity is tiny. However, most pixels that are visited are not valid ones, so this is unacceptable. In a good algorithm, most of the pixels visited should be valid.
I thought about doing a transform on the triangle and coordinate system (below) into a canonical triangle with vertices at (1, 0), (0, 0), and (0, 1).
I thought this would be a better approach because in the normal case, it's difficult to figure out a good traversal over an irregular triangle. I thought maybe I'd try a regular triangle but an irregular grid. Unfortunately, this didn't work out so well because I couldn't figure out a good algorithm to traverse the irregular grid.

Contact me

I'm really interested in getting this problem solved (hence the lengthy writeup). If you have questions, please drop me a line (jowens @ graphics.stanford.edu). Also, if anything is unclear here I will happily clarify; I intend to build a FAQ list if I get good questions.

Thanks!