// FIXME(kayvonf): with finite size push queues or loops, inspect
// bits stay on and we could deadlock
if (grGlobals.sched.stageMaskInspect.Any())
return;
The second circumstance is that the code only produces visualization
information on scheduling changes. A deadlocked application never
changes so no further records are logged as soon as things deadlock.
The only give away is that the cycle count reported by grampsviz stops
increasing and doesn't match the text in the log. Finally, there
appears to be some chance, or at least hard-to-predictability
associated with which inputs deadlock with an inspect bit on and which
don't. Hence the proper detection of deadlock on other images. Now I
just have to figure out what, if anything, to do about it (the only
possibility is to discover / report better errors, so there's limited
upside in spending too long fussing. I do like to fuss, though. And I
dislike pathological failure handling.)
Of course, it actually is sort of good news because the need for a huge log file was a symptom of a deeper flaw. I handed my histogram app a 512x512 image and then went and patiently did other things while it ran. Only it ran and ran. And ran. And ran. And eventually got killed for logging more than two gigs. Sometimes I don't think very deeplly when in scientific method mode, so I responded by turning off logging and assuming the logging was also slowing things down. After rather a while I gave up on that and tried a 256x256 image. After rather a while my brain actually engaged and I took a step back. I timed the 64x64 processing (barely more than a second) and decided that there was no way the slowdown I was seeing at 256x256 (only 16x the data) was remotely reasonable. I looked at the grampsviz output from a run I'd killed prematurely and realized that one of the subqueues was actually filling (presumably the one for zero or 256 since those seem to be the two big spikes in the images I have at hand).
Now, in and of itself, a subqueue filling is a problem, but a problem that should produce deadlock and catch the notice of the deadlock detector. I became suspicious, ratcheted the queue sizes way down (10 slots per subqueue/bucket), and tried the tiny images again. Happily and sadly, the deadlock detector promptly fired. In proper brownian motion debugging, I upped the queue sizes to 100. Still deadlock (160 of the 256 pixels in the 16x16 image are dark enough to quantize to a luma of zero). But, then I tried the 64x64 image and discovered that I have successfully coaxed it into reproducing the bad state the larger images were showing with larger bin sizes. Now I just have to figure out what's going on. I'm sure (or at least, strongly suspect) something fishy in accounting for shader instances that block on push, potentially in conjunction with queue sets or dynamically instanced queue sets. I probably also want a discting indication in grampsviz for shader instances that are blocked, but tying up a thread slot.
Alex asked a good question: What is Map-Reduce bad at? Map-Reduce is here to stay, so it's valuable to delineate both what it handles readily and for what it is ill-suited. Kunle offered matrix multiply as a weakness. I only commented about graphics, since those are the only apps I would say I know well enough. D3D/OpenGL style pipelines are going to look appealing from a high level, but be awful quagmires without some extensions to enforce ordering. Order pervades them at too many fundamental levels and tagging everything with a sequence number, buffering, and reordering at the end is anathema to how they get performance. REYES, on the other hand, seems nicely suited. In all cases, the current implicit split and partition phases in Map-Reduce align with the graphics fixed function / non-data parallel stages and I think it's possible a good implementation would want something more stateful than a simple function callback, but I'm not sure.
In other news, I stumped Amazon. A search for 'paperback bookcase' returns no furniture. Indeed, it returns nothing other than books. Bookcases optimized for the storage of paperbacks seemed like an obvious niche to me (and a wider search online produces some hits), but apparently none of the Amazon vendors agree.
Frustratingly, lots of little nuisances arose in what seemed like a minor task. First, I wanted combine to be a shader since it's logically as lightweight and stateless as any shader instance (the whole appeal of the GPGPU parallel reduction model). However, partial reduction inherently involves consuming two inputs and the current shader instance environment is entirely organized around processing single elements. I double checked and reminded myself that the combine shader sanity test I'd written had cheated and used a buffer for its accumulation of partial results. With a sparse and dynamic key space, that doesn't seem as kosher.
This didn't (doesn't) seem insurmountable, but prompted me to try to write a thread stage version of combine first. The code was straightforward: bind the same queue as input and output, reserve two packets on the input side and one on the output side, combine the two inputs into the output, and commit. Unfortunately, this promptly deadlocked. It's not exactly a real deadlock, but it's close enough to be trouble. The intermediate-pairs queue now has two producers: map and combine. Additionally, combine creates a cycle since it also consumes from intemediate-pairs. The semantic combine really wants is, "read two at a time from intermediate-pairs as long as map is running, then get a short read once there's only one packet left and map is done." That's a little too nuanced for the current deadlock detector, though. Instead, it eventually combine blocks on a two packet input reservation, map is done, and the system deadlocks. I suspect a reasonable general rule is that a stage that produces to a queue it also consumes should still receive NO_MORE if all its upstream producers complete, but I haven't thought it through in detail yet.
Anyhow, these cascading nuisances have prompted me to wonder if the right solution is a formal filter/partial-reduction shader stage type. In conjunction with queue sets, I'm pretty sure I can fake it with instanced thread stages that accumulate their partial results internally rather than constantly returning them to the intermediate-pairs queue, but that precludes creating multiple parallel instances that all handle the same subqueue for e.g., a logarithmic reduction.
Update: Indeed, faking it with a instanced thread stages that keep the partial result on their stack works correctly. No surprise.
My preliminary thinking is that rather than compacting in local store and copying into the queue, there are at least some situations where the push compaction should just be done in-place in each subqueue and the dispatch thread should maintain an array of windows per subqueue (or cache of windows per some large number of subqueues). It absolutely makes sense in the case of reduce, where the downstream isn't going to run until all data is available, so there's no pressure to get data committed early. I think it makes sense more generally whenever pushing to an unordered queue. It may cause fragmentation in the internal queue implementation, but since there's no head-of-line blocking in unordered queues, the increase in packet utilization / density should be the significant one.
There are many morals to this tale, but honestly, the schadenfreude factor is really swamping them all for me. I've just smiled as the various confused, frantic, and resigned emails have gone around. Well, and I tried to help Solomon a little, but quickly established that there wasn't much to be done. For my part, I didn't even notice any hiccups, though I suppose I would have if I'd tried to write this earlier or checkin outstanding changes.
------------------------------------------------------------------------ r1000 | yoel | 2009-04-03 16:40:43 -0700 (Fri, 03 Apr 2009) | 27 lines Make visualization of dynamic instancing actually work. All my prior changes to dump and visualize workloads with dynamic instancing were braindead. In their defense, they didn't segfault and they worked if the graph only had a single queue. That turns out not to be terribly helpful, though. Now I've added enough information to each VizRecord to determine the correct starting offset of each queue in the global list of subqueues and fixed up the surrounding code to honour it. The grampsviz images for histogram make sense and provide value now. ...Oops.
In other high quality facilities news, GCafe stalled for a good five minutes this afternoon. The reconstructed story: the building carpets are being cleaned in the evenings this week. Last night they cleaned the third floor. For highly dubious reasons, the same circuit the janitors use to power their equipment happens to be the one that drives the plasma television and projector in the graphics lab conference room. Shampooing the carpets blew the circuit breaker (which it apparently does routinely), the janitors moved to another plug (also apparently routine), and we were stuck figuring out what had happened and then tracking down the breaker panel in question (over next to the water fountains, nowhere near the conference room). Or rather, John was stuck doing that while we gave up and used a portable projector someone had.
Here's something that has happened several times: a programmer asks me to intervene in some debate he is having with a program manager. "Who is going to write the code?" I asked. "I am..." "OK, who checks things into source control?" "Me, I guess, ..." "So what's the problem, exactly?" I asked. "You have absolute control over the state of each and every bit in the final product. What else do you need? A tiara?" You see, it turns out that this system puts the burden on the program manager to persuade the programmer, because at some point, the program manager runs the risk that the programmer will give up and just do whatever the heck the programmer feels like.I encounter this dysfunction / dissatisfaction with shocking regularity. Programmers, senior well-respected programmers, express feelings of powerlessness in the face of product management, program management, regular vanilla management, etc. I always challenge them back that such powerlessness is only possible if they abdicate their inherent last word on product development. Fundamentally, engineers take their cues and strongest influence from their peers and leads. Those are the people surrounding them all day for advice, discussion, code reviews, mentoring, etc. They may have a weekly one on one with their managers and see him or her in a group meeting or two, but that's an order of magnitude less contact and social influence. Product management, program management, etc. are all even further away. Moreover, the whole reason "the organization" gives engineers fancy titles (tech lead, architect, distinguished engineer, staff engineer, principal engineer, etc.) is to convey both a recognition and trust of their judgement as well as an expectation that they will use it (I'll note that some, but definitely not all of the time, I approach the managers, PMs, etc. about the flip side of such situations, they express an eagerness for the engineers to step up and show more leadership. It's the absence of that leadership that has them knowingly doing an uncomfortable job covering rather than leaving a total vacuum). I think I'll start incorporating, "What else do you need? A tiara?" in my future exhortations to dismayed engineers that they can empower themselves as simply as ceasing to abdicate their influence.
I want to first apologize for my intrusion. My name is ---- and I am a Staffing Specialist with NVIDIA Incorporated (Nasdaq: NVDA). I'm working with Senior Directors of one of NVIDIA's most visible and prestigious teams. We are looking for a CUDA DEVTECH ENGINEER #1136737 ...I actually receive so few unsolicited recruiting emails that I still paid enough attention to judge this a pretty lame effort. There were clearly manual steps to this discovery / screening process that produced my name and the best he can do is devtech? Without even mentioning the other CUDA related openings NVIDIA's web page lists? I guess he only gets paid for hiring devtech, no matter how unlikely the fit. And that's seriously all the cover letter / introduction he can spare? It actually makes the email less appealing than if it were omitted entirely. The NVIDIA marketing youtube links in his signature are a nice touch, though.
*Technically, CPAN has a Parallel-MapReduce, but like many fledgling efforts it manages to be both massively over-engineered / complex and not actually able to express the usage I want. Alas.
Anyhow, after the PPL meeting today, we again tilted at the windmill of parallel taxonomies and required building blocks. I frequently complain that producer-consumer gets overlooked when the parallelism in an application is superficially reduced to what's "task" and what's "data" parallel and that building blocks for the producer-consumer case are a requirement for any lowest level parallelism substrate so just talking about tasks and vectors misses out. Pat justifiably complained that producer-consumer is orthogonal to the abstract academic definitions of task and data parallelism and has prompted me to reevaluate my phrasing once again. Here's what I ultimately wrote:
I suppose I'd rather take a step back from a single task/data dichotomy and say that the taxonomy of significant parallel primitives includes four things:
I've also drawn a few generalizations about how things seem to work out 'in practice':
On a vaguely related note, I think I have finally wrapped my head around what Pat calls a data parallel domain specific language. The longstanding stumbling block has been my under appreciation of the role of the lambda functions / element-wise filters applied with every map, reduce, over, group-by, etc. operation and Pat's insistence on calling it data parallel. I see it as a domain specific language for any parallel program that can be expressed as a static computation graph (or expression tree if you perceive each stage as an operator the way Pat does). It's really the ultimate generalization of all computation graph based systems. Where streaming, at least as it has been adopted, only includes 'map' shaders and GRAMPS supports 'map + push' shaders and threads (the degenerate case of applying a trivial vector operator to the single length vector and putting all the work in the lambda / filter function), this language supports a whole spectrum of distinctly typed stages. I'm curious whether it's significantly more expressible than GRAMPS (or map-reduce, or any of a number of other systems that have consciously reduced the space). If it isn't, then it's really a poster-child domain specific language. The whole thing can be implemented on a more minimal set of primitives while exposing richer higher level primitives upwards. If it is, it would be interesting to know if there's a viable fast implementation of the missing part or whether it inherently doesn't map well even to the bare metal.
The problem is minor. It's unsettling, however, to discover that extensively used and seemingly correct code is wrong. I would have taken a not-implemented in stride, but this manifestation has more of the "I found a compiler bug" feel where I become suspicious of everything and everyone. Luckily there are meds for that.
#define PRODUCEFN(fn) \ static void fn(void); \ void (*_MRProduceFn)(void) = fn; \ void fn(void) #define MAPFN(fn, input) \ static void fn(const void *input); \ void (*_MRMapFn)(const void *input) = fn; \ void fn(const void *input) #define COMBINEFN(fn, key, v1, v2) \ static void fn(const void *key, const void *v1, const void *v2); \ void (*_MRCombineFn)(const void *key, const void *v1, const void *v2) = fn; \ void fn(const void *key, const void *v1, const void *v2) #define REDUCEFN(fn, key, values, num) \ static void fn(const void *key, const void *values, int num); \ void (*_MRReduceFn)(const void *key, const void *values, int num) = fn; \ void fn(const void *key, const void *values, int num)Hopefully this means I'll think up a clever way to do it on the way home tonight. Otherwise I'll have to actually clean it up and live with it. And that doesn't count the weak symbol definitions I threw into the bottom of the header file to make it work out when an app doesn't define all the entry points.
Ian had some good comments when we talked more after dinner and I've even post facto modified my slides a little in response. Mendel asked me if GRAMPS supported task parallelism and I realized that it fits really well. Task parallelism is just an execution graph where the what's queued is activation records for sub-tasks rather than actual data. Similarly, data parallelism is just an execution graph where all that flows through each queue is a single NO_MORE/ALL_DONE as kernels complete. Pat liked my characterization of task-parallel as Divide and data-parallel as Conquer (which, by the way, I claim is truly accurate, not just a snappy rhetorical device).
I think we explicitly agreed that from the CUDA/source code perspective there was no way to tell the difference whether the compiler or hardware implemented the vector masks to deliver a scalar environment, but he also emphatically stated (1) that the two were fundamentally different and (2) they were disinclined to tell us why the two were different or offer any insights into choosing one design over the other. He stated that any further how or why details got into NVIDIA 'secret sauce' and it made no business sense to reveal anything further since it would help their competitors. I found that very surprising-- how do you persuade people the merits of an architecture without complementing 'what' with 'why'? Well, perhaps not pure end users, but surely a community of designers and researchers examining a variety of choices not only to use, but to which to contribute, for which to build, etc.
My experience and instinct argue there should be a qualitatively meaningful middle ground for providing useful high level information without giving away competitive advantage. Is there truly a trade secret that is so simple, so central, and yet so unpatentable that protecting it prevents offering any insights? I wouldn't expected that and it would make it very awkward for satisfying my honest and good faith intellectual curiosity about design choices and the corellary lessons to be learned.
(A note: SIMT does offer the potential for an implementation that can be discerned from threaded vector. There is no inherent limitation that only one program counter within a warp execute at once. At Kurt pointed out last year, one can easily conceive of a SIMT implementation that can dispatch multiple different instructions per cycle in the event of a divergent warp. It remains an interesting question whether such an implementation immediately trends towards as many independent instructions as threads in a warp or whether a warp still reflects an interesting execution coherence grouping.)
11 November 2008:
Now, obviously one could implement exactly the same thing with a parallel for() loop (use the loop index as the thread ID and paste in the same kernel body). I also think the resource footprint and practical communication constraints imposed by current GPU shader core architectures are a good match for the expectations embodied by the conversational shorthand, "basic parallel for() loop". However, I understand how an argument that CUDA is more than that can be defended on marginally technical grounds by a particularly dogmatic partisan.
Note that despite that, I think CUDA as a programming model has actually demonstrated itself a valuable contribution to commodity parallel programming. Despite my teasing and longstanding dismissiveness, I sincerely hope Intel implements a "scalarizing shader/kernel compiler" that runs CUDA-like kernels on Larrabee and NVIDIA, Intel, or someone makes a run-time available for out-of-order x86 cores, too. For the class of apps that it admits, it offers a relatively simple abstraction to developers that admits highly parallel and efficient implementations on widely available hardware. Deep conceptual work or not, that sounds significant to me.
Upon consideration, this seems to reflect two different divergences from my world view: operation on in-place/persistent data and the ephemerality of tasks. The former is easily shown with an example from cloth simulation. Such simulations (or at least, many such simulations) are based on a preset point-spring system. The first kernel of each time step visits all the points, performs local force computations, and computes each point's proposed new position and velocity. In that sense, it's like a vertex or fragment shader. However, unlike transformed vertices or shaded fragments, the grid of points persists statically across every time step. Transformed vertices are queued and discarded after rasterization, but "transformed" cloth points are the input in the next pass. Additionally, they're on a preset, pre-allocated grid that is not even repacked, let alone variable length. Thus, it makes vastly more sense to assume that a task, or set of tasks, will operate in-place on the grid rather than consume it and queue output downstream. In contrast, when a kernel (or set of tasks) generates a variable number of outputs or transient intermediate results dynamic queueing handles allocation and scheduling the producing and consuming tasks co-residently saves huge amount of bandwidth and storage from spilling all the intermediate data and recirculating it. Given the longstanding emphasis on iterative simulations and solvers and the relative inexpense spilling to memory in past networked clusters compared to current chip multi-processors, I can see how co-resident producer-consumer parallelism has been overlooked.
The other divergence I see is that other parallel systems-- task or data-parallel-- assume tasks are ephemeral. Specifically, they assume that preemption is never necessary: with only minor/short-lived disturbances, load balancing can be achieved by waiting for tasks to finish and appropriately placing new ones. This has a major consequence: no task-local statefulness. That's an abstract mouthful that deserves an example. Consider a kernel, task, or set of collaborating tasks that want to perform an operation that spans multiple input elements: e.g., computing the median of all inputs, sorting fragments in depth order and resolving transparency, or even just caching the last N inputs or intermediate results for some purpose. There are two choices straightforward choices for scheduling such functions: defer launching them until the system is certain all of their input is ready or launch as soon as the first input arrives, but deschedule/preempt when more input is temporarily unavailable. Only the former is compatible with ephemeral tasks, but requires potentially huge bandwidth and storage to spill and recirculate (much like the variable output problem above).
There's a third choice that advocates of task systems (or users who happen to be stuck with them) suggest: fake the latter by blocking the input into chunks and running a series of sub-tasks that compute a partial reduction and spill enough internal state for the next task to pick up and continue. There are no shortage of reasons to complain about this "solution", mostly the standard event/continuation warts: sub-task management and bookkeeping ends up dumped on the application writer, the internal/continuation state needs to be synchronized with some ad hoc locks, barriers, etc. I prefer a simpler objection, though: preemption of threads is not a difficult problem and is one long since solved (it is even routinely built into processor cores these days). Why inflict ugly program transformations on the programmer to fake a semantic that is so easily implemented?
I think both of these "weird" (to me) expectations reflect of a history emphasizing networked clusters or parallel supercomputers that were rich (at the time) in per-node resources. And, also likely a heritage running offline applications. In both cases, the relative costs of spilling even megabytes of intermediate buffering to seconds (or minutes or hours) or kernel computation time are acceptable. At real-time and interactive rates, there are just no resources: GPU designers tell me they have been off-chip bandwidth limited for multiple generatiors and they already go to extreme lengths never to spill intermediate data from on-chip FIFOs. I suspect that at real-time rates, the ephemerality assumption for load-balancing can also become questionable. It clearly holds for shader-like tiny thread instances, but I remain skeptical about launching oodles of short-lived tasks rather than few longer lived tasks that dynamically produce and consume on queues. I realize the two models are duals and it is possible the compiler and programming language folks can transform between them effectively, but it seems more likely to map the latter onto actual hardware with reasonable overheads. I am also unsure what the most comprehensible task-dual is to specifying queue/buffer lengths (obviously a limit on task spawn/fan-out, but since tasks cannot just block if they try to spawn while at the limit, it seems potentially ugly).
One concluding note: one can get a lot of mileage out of dependency chains of ephemeral tasks which only communicate at entry and exit. That is, these models evolved for good reasons that still hold. For that matter, CUDA is an even more constrained model-- there is not (yet) an explicit dependency mechanism and the tasks are confined to trivial data-parallel kernels-- and it is widely gaining adoption by people who find it fits pieces of their workload (or can be shoehorned into roughly fitting and the performance plus simplicity trade-offs appeal). I have even found that GRAMPS needs an analog of the dependency/barrier mechanism for the cases where it makes sense to operate in place rather or a reduction needs to know when it has seen all relevant input. However, I am surprised by the apparent blind-spot to recognizing the locality advantages of producer-consumer parallelism between co-resident threads. It seems like a straightforward extension to task parallelism whereas trying to infer it automatically and reconstruct it via task-fusion and program transformation seems very difficult.
Any programming model brings generality (if it didn't, it would be an application, not a programming model). And, generality is directly opposed to the programming mode/run-time/hardware being able automatically to maximize optimization. Instead, I would claim that any programming model that has been successful/popular has included explicit mechanisms for its power users (i.e., savvy developers) to tune, guide, and/or optimize. Even programming models that emphasize ease over performance-- e.g., many scripting language and probably all managed code environments-- develop extensive practices and features for enhancing performance. These show up consistently as the same thing: guides for the progamming model to exploit its own performance features better. Map-Reduce implementations tend to let power users control the number of map and reduce tasks and the partitioning in case the best effort black box strategy for load balancing can be improved. The many-core versions expose a knob for appropriate batch sizes so that users can tune based on the (non-input) working set sizes of their map functions. Similarly, GRAMPS lets users specific maximum queue sizes as explicit load balancing hints to the scheduler. CUDA makes explicit the size of thread group launches because it's a correctness requirement for some use cases of the shared cache, but it's also an optimization parameter for balancing number of groups versus group size in many others.
A completely separate observation: GRAMPS is not alone in leaving it a separate problem to write good kernels. All of the parallel programming models about which I'm reading are focused on structuring the skeleton the application for performance and abdicating a discussion or ownership of optimizing kernels (whether they're called kernels, shaders, mapper/reducers, tasks, etc.). This is also entirely natural: when the focus is on obtaining speedup through parallelism, obtaining speedup in the various sequential instances is an orthogonal problem. The sequential bits are also common too and richly mined by the existing work in the context of serial execution. Natural or not, it is nice to see it confirmed de facto (none state it outright) by other work in the area.
Update (4 Dec 2008): We're received confirmation that the Industrial Practice track dissolved and our submission is a regular part of the conference. Poof.
The situation's a little different for the GPU-like configuration. Its scheduler is already synchronous. That amounts to queuing less work and isn't very susceptible to the head of line blocking scenario above. That said, it develops on pipeline bubble I can't currently explain with confidence. Here is the queue and core occupancy for a 1024x1024 rendering of the courtyard scene. I don't fully understand the bubble that develops about 60% of the through in conjunction with the late spike in RO work. That's downstream of the critical weirdness though. There are occasional short bursts of vertex work surrounded by fragment and blend work such as the one where the cursor is in the image. Given the priorities, that vertex work should have been scheduled as soon as it appeared, but I can't explain why it just appeared then. Even if it were backlogged on a pending scoreboard, each thread has run both blend and fragment work and those should have been pre-empted. Order is always the whipping boy for weird things, but I haven't pieced together the full narrative yet.
It's worth noting that the lazy scheduler has exactly footprint trade-off against the eager scheduler that one would predict. Here is the same occupancy picture for the eager scheduler. It goes through far more vertex - fragment - blend cycles and has the pipeline bubbles from exhausting all fragment work, but not being able to fill the machine with vertex-work alone. The thing is, lazy does not have significantly better thread slot occupancy and (unsurprisingly) it has a much higher queue footprint. The lazy scheduler gets 95.3% occupancy with a 5.5 meg worst case queue footprint. The eager scheduler gets 95.0% -- almost the same -- but has a 1.3 meg footprint. Clearly I need to understand the bubble in the lazy scheduler. It may also be that eager will be unfairly successful so long as scheduling is 'free'.
Interesting. Apparently, at least some versions of elinks can use SpiderMonkey to do limited Javascript. It's too bad elinks is so much clunkier looking than lynx (in fairness, it tries to replicate a full fledged renderer when that's actually counter to my interests).
Apparently, it worked as Pat declared at the end that when I came in as a systems person he had wondered if I'd ever be a graphics person, but that I'd just demonstrated I was legitimately a graphics person. I wonder if this means I should stop telling people who ask about my Ph.D. program that I'm pretending to be a graphics person.
Oh, and printing from Vista sucks.
The G80 with its 128 scalar arithmetic units running at 1.3 GHz should deliver over 160 GFlops, meaning that _tracing one ray costs about 10,000 cycles_ [!! emphasis added]. We suspect the main bottleneck to be the large number of registers in the compiled code, which limits the occupancy of the GPU to less than 33%. Unforunately, although the program requires much less registers, the CUDA compiler is not yet mature enough and cannot aid in reducing their count. An option would be to rewrite the whole CUDA code in PTX intermediate assembly.The unfortunate CUDA compiler is an old familiar whipping boy of mine. I wonder if they've figured out yet that the intermediate assembly doesn't help with register allocation (it's entirely done by the backend compilation to the final form) or if NVIDIA has relented on that stance. I resorted to using the shared cache as excess register file and manually spilling my ray data there while computing intersections. That saved a huge number of registers without limiting parallelism (since I was still slightly register bound, not shared space bound). And of course, I have to wonder what happens when they discover that no amount of parallelism can hide the divergent execution penalty within a SIMD group, warp, or whatever they're called these days.
The step I call "rasterization" has a variety of other names in the REYES and RenderMan literature, but it is entirely rasterization! A collection of surface patches in object space are tranformed to image space and sampled according to pixel (and subpixel) locations. OpenGL and Direct3D call the results fragment while RenderMan calls the results visible points, but they are the same results of the same process. Credit to Kurt for hammering home that not only is this "just like rasterization" but it simply is rasterization. Note that the REYES rasterizer is differently complex than the GL/D3D rasterizer. The REYES rasterizer is simpler because all it ever gets are shading grids (effectively quad strips) and they're very regular. The REYES rasterizer is more complicated because it supports interpolated primitives. I.e. grids that have an associated parameterized location and whose parameter varies among different samples. This is REYES' "stochastic sampling", but in practice it seems to me specifically interpolation rather than a more general scheme. It's also absolutely fundamental to REYES as opposed to a rare corner case that can be marginalized or ignored.
Lpics strikes me as weird because they took this framework and inserted themselves logically in the middle. They split out just the lighting portion of shading and converted it to run in image space instead of object space. Then they turned off transparency so that visible points corresponded one to one with pixel locations. Then they turned off antialiasing so that REYES' surface shading corresponded one to one with a GPU fragment's image space shading. Within this framework, they took their modified subset of shading, tranformed all the inputs to be in pixel space instead of object space, and ran them at the end of the GPU. I agree that it's clever and effective, but it does seem a little contorted. I suppose GPUs of that era basically mandated contorted in order to accomplish anything besides GL/D3D. The unfortunate and unexplained thing is the decision to run the GPU shaders in pixel space instead of object space. That inherently denies them the visibiliy supersampling that's so central to REYES. It doesn't seem like it would have been that hard to literally interpose in the middle instead of just logically. On DirectX 10 era GPUs, I'd actually take things a step further and cut-over the whole bottom of the REYES pipeline over to the GPU...
It sounds like Lightspeed chops things up more like I would envision. However, when Jonathan presented it here last school year, I didn't really have the context to understand enough that my retention is any good so I only have offhand remarks from Kayvon and Jonathan to back that up.
It's an interesting perspective that matched a lot of my analysis while still having some divergence. I definitely believe and agreed that the post-dicing REYES model is roughly equivalent to GL vertex shading and then rasterization without (or with trivial) fragment shading. While I agree that technically the GL evaluators resemble the REYES dicing, I think the fact that they've essentially remaining vestigial pieces of GL ignored by the hardware and applications makes the congruence weak. The REYES rasterizer is also a different creature from the GL rasterizer. The REYES "rasterizer" knows that its primitives are only quads and whose area is only on the order of a single pixel. However, it also has to support "stochastic sampling", which is really just a fancy way of saying it has to additionally sample in time or lens position for motion blur or depth of field. Arguably the most major difference, though, is that the REYES rasterizer delivers order independent transparency. It cannot commit any samples to a pixel until all samples are available and sorted in depth order (or at least all non-opaque samples). The driving motivation for the bucketing scheme in REYES / PRMan was to limit the duration over which samples for a pixel needed to be kept around uncommitted.
Anyhow, after the Gates kick-off party and a ray tracing interlude with Solomon, Jonathan showed up. He's coyly extracting free plane tickets in exchange for enduring a job interview or two (secretly he thrives on the wheeling and dealing). He wanted to hear what we're up to and Kayvon's running off to Seattle for the weekend and Mike's involved in delicate political wrangling over trade press articles on GPU programming (sucker). Since it's been the topic of the week, I ran through assemble threads, shader threads, and the whole fan-in / fan-out / vertex caching / data sharing situation quickly and then sprung "GPU REYES" on him, too.
We agreed that the two interesting questions are: is there a picture REYES can make that GL cannot (and vice versa, but restricting to photo-like pictures) and in areas where the two are technically equivalent, what idioms / abilities are more naturally associated with REYES vs. GL? At the heart of both is an attempt to understand why, from a technical perspective, people want realtime REYES (because some definitely do), and what it is that they actually want. Clearly there's the fantasy of wanting PRMan renders at GPU rates, but just as clearly that's fantasy. Something we think is open and not fantasy is what implementation of REYES / subset of PRMan is practical for realtime / multicore / "GPU-like" implementation if we successfully produce an architecture that generalizes GPUs enough to accomodate other alternatives.
At least I have a reading committee now. If I'd realized I was going to do it without a specific extended thesis proposal, I'd have submitted the form back in February. Ah well, at least my thinking is more organized now.