Rasterization problem: Frequently Asked Questions

If you control both the hardware and the software, why don't they do what you want them to do? Wouldn't it make more sense to investigate how you'd add programmable elements to the proven triangle processor design?
Are you computing eight different triangles at the same time, or are you working on different pixels of the same triangle? If working on different triangles, why? Are the costs of communication between the arithmetic clusters so high?
What are the real costs of this "select" function?
You don't mention any "costs" for accessing external memory.
Just knowing if the pixel is inside the triangle is not enough for rasterization. You also need to compute depth and shading information. Could you work that into the loop? You should be able to hide your latency with this additional computation.
Why not the bounding box? The math to traverse the bounding box is very simple.
Why not quadtree traversal?
Can you work on several triangles simultaneously?
How are cctoi and itocc actually handled in the hardware?

1: If you control both the hardware and the software, why don't they do what you want them to do? Wouldn't it make more sense to investigate how you'd add programmable elements to the proven triangle processor design?

Answer: One of the key areas of our research is exploring how we can do with only programmable hardware. Modern microprocessors also have only programmable hardware, but they are not able to apply the number of functional units or the necessary data bandwidth to the problem like we can.

Current graphics hardware is rapidly becoming more programmable. In general, graphics vendors have been adding programmability by beginning with specialized hardware and adding programmable elements. We are taking the opposite approach: begin with a fully programmable machine, and explore from that what specialization might be useful for graphics.

Our processor is targeted at a wide range of media applications, including image, signal, and video processing. We haven't put in any specialization for graphics (or other application domains). There is no question that a hardware rasterizer (for instance) would help performance and would solve this problem, but that's not an option. It's an avenue I intend to explore in my research -- what hardware specialization might make this architecture better suited for graphics -- but right now I am trying to solve the problem with the fully programmable hardware we built.

In any event, we culminated our four-year design process by sending the top-level design to the foundry last week. So there won't be any more hardware changes.

2: Are you computing eight different triangles at the same time, or are you working on different pixels of the same triangle? If working on different triangles, why? Are the costs of communication between the arithmetic clusters so high?

Answer: Each cluster works on a different triangle. One reason for this is because the inter-cluster bandwidth is not nearly as large as the bandwidth within a cluster (inter-cluster bandwidth is only one word read and one word written per cluster per cycle). Imagine does not have a high-bandwidth intercluster broadcast mechanism.

But a better reason is because I am making sure that small triangles will be rasterized efficiently -- I don't want to optimize for large triangles. If 8 clusters all worked on the same triangle, then they would output a minimum of 8 pixels for that triangle. Especially for small triangles, many of those pixels might not be valid, and we are trying to make our percentage of valid pixels as large as possible. The UNC PixelPlanes machine did exactly this: it had SIMD hardware, but used it to work on different pixels of the same triangle at the same time. This is efficient for large triangles where the number of pixels per triangle is considerably greater than the number of SIMD units, but as triangles grow smaller, it becomes less efficient.

A final reason is that we'd like the algorithm to be generalizable to more clusters - 16 or 32. As the width of the machine increases, and triangles get smaller, it gets harder and harder to get performance gains by working on only one triangle at the same time.

3: What are the real costs of this "select" function?

Answer: The select function is actually really cheap. It's a one-cycle function that can be executed on any functional unit (adders, multipliers, divider, or comm unit). So we can issue 7 selects per cluster every cycle. This is handy when selecting large records. For example, the prepped triangle has 8 words in it. We can select between two triangles in 1 cycle + 1 issue slot of a second cycle.

In practice, since the divider isn't used in this algorithm, many of the selects will be executed on it and not use the issue slots of the adders or multipliers.

4: You don't mention any "costs" for accessing external memory.

Answer: In this operation, you shouldn't have to access external memory. The only information you need to evaluate which pixels are covered is the triangle, which is on the order of 10 words, and any state you care to store during the computation. A cluster has over 200 words of local storage in register files (and another 256 words in the stream register file), so all computation can -- and should be -- local, without using any external memory.

5: Just knowing if the pixel is inside the triangle is not enough for rasterization. You also need to compute depth and shading information. Could you work that into the loop? You should be able to hide your latency with this additional computation.

Answer: This is an excellent question and goes to the heart of how we designed our rasterizer. Short answer: We are attempting to completely separate pixel coverage from interpolation, and are only considering the pixel coverage information.

Originally, we calculated pixel coverage (which pixels are covered by the triangle) and interpolation (interpolating the color, depth, etc. parameters across the triangle) at the same time. Most scanline rasterizers do this: as you traverse a span, for example, you incrementally compute the interpolants, doing one add per pixel per interpolant. This algorithm is efficient in terms of computation.

Unfortunately, as the number of interpolants grew very large (one of our programmable shaded objects, the UNC bowling pin, had 30 interpolants), the amount of local storage needed to handle all the computation was much larger than our hardware could allow. We would like to be able to handle an arbitrary number of interpolants, and wanted to design algorithms which would allow us to do so.

The insight that made this possible was the separation of pixel coverage from interpolation. If we compute all the pixel locations first, we know that's a tractable problem (it's the exact problem I'm posing here). Then, even with a huge number of interpolants, we can process the many interpolants in smaller batches which will fit on our hardware with little to no performance loss. The interpolation step, once pixel coverage and barycentric coordinates at each of those pixels is calculated, is a separable computation, and efficiently so.

For this to work, however, it is necessary to have an efficient pixel coverage algorithm, which is why I posed this question.

6: Why not the bounding box? The math to traverse the bounding box is very simple.

Answer: Agreed on the math. But the percentage of invalid pixels generated is just too large. In the best case the bounding box will only have a 50% valid-pixel rate, and the worst case (skinny diagonal triangles), the percentage of valid pixels is very low. Because I have to compute all the pixel coverage information in just this one loop, I can't go back and clean it up later by removing the invalid pixels from the stream. (I lose the ordering information needed in later steps if I do so.)

7: Why not quadtree traversal?

Answer:

(Thanks to Lukasz Piwowar for this picture.) A quadtree is efficient for large triangles, but as triangles become smaller (1-5 pixels) the overhead due to the traversal is large compared to the amount of work spent per pixel.

8: Can you work on several triangles simultaneously?

Answer: Unfortunately, no (although it would really help throughput). The reason is because when I output pixels, I have to output all pixels from the same triangle sequentially without interruption from other pixels from other triangles. Also, I need to maintain ordering on the triangles, so I can't, say, chop the input stream of triangles in half and work on the first half and second half simultaneously.

9: How are cctoi and itocc actually handled in the hardware?

Answer: For the purposes of this discussion, condition codes (cc) are stored in a one-bit wide register file. (They're actually 4 bits wide and can select on a byte basis, but we're not doing that here.) So itocc(i) just takes the low word of i, and cctoi(c) replicates the bit in c across all bits (cctoi either outputs 0 or 0xffffffff).

Last update: Thursday, Sep 6, 2001 16:34:09 PDT
Back to the problem