Interesting. Apparently, at least some versions of elinks can use SpiderMonkey to do limited Javascript. It's too bad elinks is so much clunkier looking than lynx (in fairness, it tries to replicate a full fledged renderer when that's actually counter to my interests).
Apparently, it worked as Pat declared at the end that when I came in as a systems person he had wondered if I'd ever be a graphics person, but that I'd just demonstrated I was legitimately a graphics person. I wonder if this means I should stop telling people who ask about my Ph.D. program that I'm pretending to be a graphics person.
Oh, and printing from Vista sucks.
The G80 with its 128 scalar arithmetic units running at 1.3 GHz should deliver over 160 GFlops, meaning that _tracing one ray costs about 10,000 cycles_ [!! emphasis added]. We suspect the main bottleneck to be the large number of registers in the compiled code, which limits the occupancy of the GPU to less than 33%. Unforunately, although the program requires much less registers, the CUDA compiler is not yet mature enough and cannot aid in reducing their count. An option would be to rewrite the whole CUDA code in PTX intermediate assembly.The unfortunate CUDA compiler is an old familiar whipping boy of mine. I wonder if they've figured out yet that the intermediate assembly doesn't help with register allocation (it's entirely done by the backend compilation to the final form) or if NVIDIA has relented on that stance. I resorted to using the shared cache as excess register file and manually spilling my ray data there while computing intersections. That saved a huge number of registers without limiting parallelism (since I was still slightly register bound, not shared space bound). And of course, I have to wonder what happens when they discover that no amount of parallelism can hide the divergent execution penalty within a SIMD group, warp, or whatever they're called these days.
The step I call "rasterization" has a variety of other names in the REYES and RenderMan literature, but it is entirely rasterization! A collection of surface patches in object space are tranformed to image space and sampled according to pixel (and subpixel) locations. OpenGL and Direct3D call the results fragment while RenderMan calls the results visible points, but they are the same results of the same process. Credit to Kurt for hammering home that not only is this "just like rasterization" but it simply is rasterization. Note that the REYES rasterizer is differently complex than the GL/D3D rasterizer. The REYES rasterizer is simpler because all it ever gets are shading grids (effectively quad strips) and they're very regular. The REYES rasterizer is more complicated because it supports interpolated primitives. I.e. grids that have an associated parameterized location and whose parameter varies among different samples. This is REYES' "stochastic sampling", but in practice it seems to me specifically interpolation rather than a more general scheme. It's also absolutely fundamental to REYES as opposed to a rare corner case that can be marginalized or ignored.
Lpics strikes me as weird because they took this framework and inserted themselves logically in the middle. They split out just the lighting portion of shading and converted it to run in image space instead of object space. Then they turned off transparency so that visible points corresponded one to one with pixel locations. Then they turned off antialiasing so that REYES' surface shading corresponded one to one with a GPU fragment's image space shading. Within this framework, they took their modified subset of shading, tranformed all the inputs to be in pixel space instead of object space, and ran them at the end of the GPU. I agree that it's clever and effective, but it does seem a little contorted. I suppose GPUs of that era basically mandated contorted in order to accomplish anything besides GL/D3D. The unfortunate and unexplained thing is the decision to run the GPU shaders in pixel space instead of object space. That inherently denies them the visibiliy supersampling that's so central to REYES. It doesn't seem like it would have been that hard to literally interpose in the middle instead of just logically. On DirectX 10 era GPUs, I'd actually take things a step further and cut-over the whole bottom of the REYES pipeline over to the GPU...
It sounds like Lightspeed chops things up more like I would envision. However, when Jonathan presented it here last school year, I didn't really have the context to understand enough that my retention is any good so I only have offhand remarks from Kayvon and Jonathan to back that up.
It's an interesting perspective that matched a lot of my analysis while still having some divergence. I definitely believe and agreed that the post-dicing REYES model is roughly equivalent to GL vertex shading and then rasterization without (or with trivial) fragment shading. While I agree that technically the GL evaluators resemble the REYES dicing, I think the fact that they've essentially remaining vestigial pieces of GL ignored by the hardware and applications makes the congruence weak. The REYES rasterizer is also a different creature from the GL rasterizer. The REYES "rasterizer" knows that its primitives are only quads and whose area is only on the order of a single pixel. However, it also has to support "stochastic sampling", which is really just a fancy way of saying it has to additionally sample in time or lens position for motion blur or depth of field. Arguably the most major difference, though, is that the REYES rasterizer delivers order independent transparency. It cannot commit any samples to a pixel until all samples are available and sorted in depth order (or at least all non-opaque samples). The driving motivation for the bucketing scheme in REYES / PRMan was to limit the duration over which samples for a pixel needed to be kept around uncommitted.
Anyhow, after the Gates kick-off party and a ray tracing interlude with Solomon, Jonathan showed up. He's coyly extracting free plane tickets in exchange for enduring a job interview or two (secretly he thrives on the wheeling and dealing). He wanted to hear what we're up to and Kayvon's running off to Seattle for the weekend and Mike's involved in delicate political wrangling over trade press articles on GPU programming (sucker). Since it's been the topic of the week, I ran through assemble threads, shader threads, and the whole fan-in / fan-out / vertex caching / data sharing situation quickly and then sprung "GPU REYES" on him, too.
We agreed that the two interesting questions are: is there a picture REYES can make that GL cannot (and vice versa, but restricting to photo-like pictures) and in areas where the two are technically equivalent, what idioms / abilities are more naturally associated with REYES vs. GL? At the heart of both is an attempt to understand why, from a technical perspective, people want realtime REYES (because some definitely do), and what it is that they actually want. Clearly there's the fantasy of wanting PRMan renders at GPU rates, but just as clearly that's fantasy. Something we think is open and not fantasy is what implementation of REYES / subset of PRMan is practical for realtime / multicore / "GPU-like" implementation if we successfully produce an architecture that generalizes GPUs enough to accomodate other alternatives.
At least I have a reading committee now. If I'd realized I was going to do it without a specific extended thesis proposal, I'd have submitted the form back in February. Ah well, at least my thinking is more organized now.
Anyhow, I've decided that key points to express (and there are so many, all mutual forward references, which is why picking the ones with which to open is such a tangle) include:
Superficially this seems like an unlikely article of faith to add to work queues, but the two are actually very intertwined. Producer / consumer queueing offers a very convenient way for the two styles of cores to transfer work back and forth at a fine granularity while preserving enough context to group and dispatch it efficiently. They are the very mechanism which breaks the batch oriented feed-forward pipeline into a multi-directional state machine and one aspect of that change allows work to move between states serviced on compute cores and on conventional cores.
Commodity operating systems still retain the design principles developed when processor cycles were scarce and RAM was precious. These out-dated principles have led to performance/functionality trade-offs that are no longer needed or required; I have found that, far from impeding performance, features such as safety, consistency and energy-efficiency can often be added while improving performance over existing systems.Too bad the talk itself is at 4:15PM on my birthday. Maybe I'll end up going anyways.
Personally, I think the biggest challenge remaining still is to articulate the framing. I really don't like the instinctive pigeon-holing that we're trying to talk about next generation GPU++ style designs and that everything has to be scrutinized and policed carefully for its negative impact on graphics. I don't care about graphics! That's intentionally a bit of an overstatement, but the fundamental point to me is actually about CPU evolution: the combination of a trend towards compute-dense cores (be they GPU, Niagara, Larrabee, Ageia, or whatever style) that intrinsically offer an order of magnitude compute win over CPUs in the same deisgn budget and the universality of multi-core CPUs means that the "right" future CPU will be heterogeneous-- a number of conventional fat cores (single threaded / ILP performance isn't giving up its mindshare any time soon, whether or not one accepts that it truly matters technically) and a number of compute-maximizing cores.
It's really exactly the same argument as for on-chip FPUs or SSE. There's certainly always other things, even if only a tiny bit of extra caching, that could be eked out with that die area / transistor budget, but they're useful enough for applications that are important enough and the opportunity cost is small enough that they're universal on modern CPUs. I claim the same is on the cusp of being true for FLOP-dense cores and that the ubiquitity advantage of making them part of the nominal notion of CPU and ability to co-opt CPU - compute core coordination, scheduling, and communication into the ISA is huge. That's really basically my claim, along with a bunch of more concrete opportunities to use that architecture to tweak / change the givens of today's CPU plus compute accelerator model to recover a lot of the flexibility, expressiveness, and porting pain researchers currently incur trying to do anything beyond toy programming of compute cores. Even without those improvements though, I think there's a huge insight (though calling it an insight when it seems so obvious to me feels unjustified) in just recognizing that pulling in a compute core as a first class aspect of the concept of CPU is a huge opportunity. And, at the same time, it's really not a radically different hardware change than continuing to bulk up the SSE units, make them more programmable, and devote a hardware thread to them.
He followed the talk with an NDA-concealed much more concrete talk that included a lot of lively debate and discussion. And nearly all the faculty who have offices on the third floor. Unfortunately, from my perspective, we got bogged down in the nitty gritty microarchitectural details of buses, cache organizations, etc. for a long time and the higher level architectural pieces got short shrift and the OS / applications level pieces were extremely rushed. It also makes it challenging to write more concrete my impressions of the public talk without worrying about compartmentalizing the private content.
And nobody paid any attention to hybrid, even though they immediately admitted it was the way to go for making a system like this practical. Grr.
Unfortunately, I've also successfully persuaded him (and more importantly myself) that we need a Shader Model 4.0 interpreter as the first stage in our testbed. It's nice to have something concrete to develop, but in many ways I would rather continue to focus on maturing our ideas and writing before jumping into coding.
Reminder-- our off again, on again relationship with Gates 392 is on again this coming Tuesday. Jeff will tell us something that most likely will make us all question whether our jobs and research are morally fulfilling and Kayvon will feed us something to appropriately celebrate the mutual Texan pride he and Jeff share.
Unfortunately, the paper totally derailed the thesis topic pursuit and I was sluggish about recapturing my mindset. Conveniently, perhaps, Kathi decided to encourage me along by putting my registration on hold without warning at the start of December. That was a bit of a surprise, and turned out to be punishment for TAing with Kayvon thus also getting snagged while Mike and Daniel squeaked through. It effectively got my attention though, and Pat's. We had a series of frustrating conversations around topics I'd articulated and what feels like just an impedence mismatch in the way we approach things and understand words like claims, assumptions, hypothesis, etc. Ultimately we not only managed to understand each other, but he managed to talk me into a much more gradiose vision that I'd laid out and dismissed as too big and, more importantly, too much an argument over vision instead of something that could be demonstrated. It even converged nicely with Kayvon's new interests, so we've gotten the band back together, so to speak. Now I just need time to think and time to work. Hopefully that will be easier once the admissions committee work is done in a month. And we revise the ray tracing paper for its camera-ready deadline in two weeks.
9 June 2006:
Therefore, I've organized my thinking around potential thesis titles and accompanying premises. Pat and I spent a while yesterday discussing (arguing about) our perspectives and I have four to four and a half quite plausible topics (the half isn't quite so fully baked in that it's based on shading work into which I haven't yet dug deeply enough to be confident there's an answer, plus I go back and forth on how much I actually care personally about shading) and two titles that expressly are not worth pursuing (both overlap with my 'good' titles in the critical places and are ill-posed in other aspects).
The good news is four and a half valid thesis topics is plenty. The bad news is that they cover a wide scope and are sufficiently different than focusing on one means putting aside projects I really want to see completed. The other bad news is that, naturally, the most deep / visionary topics are also the ones least liable to actually be useful and involving the most hand-waving and simplifications. By and large, that can be what one expects from an academic thesis, but a piece of me rebels and argues that vision and hand-waving are critical to have framing the choice of projects and topics, but that the projects and topics themselves should stand on their own as a valuable incremental contributions even if the vision never comes to pass. This is tied to my conviction that it's always better to hire someone who can get his hands dirty and ship code than someone full of vision who always hand-waves the implementation details and thus produces projects that never actually deliver or work quite right. I wouldn't trust a bridge builder who pitched an amazing span but admitted he had no idea how to come up with materials that would actually tolerate the induced strain. You can argue that the vision will prompt lesser minds to solve the engineering details, but I don't buy it. I 'invent' an unlimited number of things that just aren't possible and if someone else later comes along and makes them possible, he deserves the credit, not me. I argue the same applies to at least my own thesis or I'll never be satisfied. This is also why I came to believe academia is a bad place for me, though I may be good for it in the same way that eating one's brussel sprouts is good for one.
26 May 2006:
I'd like to say that all of those rumours are massively overblown. They're spun around kernels of reality, but in many ways developing for Cell is just like developing for a chip multiprocessor (or any other SMP). If you don't want to involve the PPE core in your compute kernels (and you certainly don't need to) then there's write-once support code to spin up the 8 SPEs (or however many you want) and launch your apps. You write it once based on either the sample code or the tutorial and never look back. Occasionally, if your workload needs it, you add some very simple message passing for SPEs to signal the PPE when they need to be sent more work and the PPE to respond when more work is sent. Anyone who has ever written a work queue or used a socket for signalling can do this three days dead. It's no different for Cell than via Win32 events or BSD sockets. The APIs just have their own function names. Hooray.
Okay, so now you've gotten your SPE up and running. To be precise, just like starting a thread on any other OS, you've issued a library / system call that took a chunk of code (the ELF binary compiled for the SPE), an entry point (thread main), a void * with initial arguments, and some unimportant optional flags. What about the SPE code? You take gcc and hand it vanilla C code. Or, you take xlc and hand it vanilla C code if you think xlc is more elite than gcc (we don't. Other people around here do. It seems to vary according to personal taste and application). Okay, so it's not quite that easy. You can transparently use stack allocated memory and static / global arrays or objects that are small enough to fit in local store. You cannot transparently malloc huge chunks of memory or dereference pointers to large, system memory backed regions. However, you can mechanically convert every dereference of your big data structures into synchronous DMA and you'll come out with working code. If you're writing from scratch or have anything resembling an accessor function, this is near trivial. The DMA builtins actually take system memory pointers as their argument without any translation or anything. We hit this point with the ray tracer after a few days' messing around. And a major portion of that time was browsing reference material and one-time cobbling together of a Makefile with the correct include paths and what-not to run the cross-compile toolchain.
That's it. Porting in a nutshell (I know, why is it in a nutshell and how do you get it out? Don't ask). But, but, but, you splutter, what about having to restructure your whole application? What about fitting your code + stack + data in 256KB? You don't have to restructure anything. You always can run on the PPE. If you want performance from the SPEs, you will have to multithread at least part of your application into at least 8 threads (or however many SPEs you want to use). However, if you want performance with any chip including current conventional CPUs, you have to multithread the computationally dense portion of your application. Multithreading for Cell isn't intrinsically different from multithreading for a "normal" CPU. So, while I'm happy to grant that having to multithread your code to get performance can be a pain, it's no special barrier unqiue to Cell. Now, as for fitting in 256KB, well, that's a ton of space. I'm clearly a child of the wrong era and it's about to be very clear, but once all your application data is excluded (it lives in system memory, not local store), 256KB is great! With four byte instructions (I have no idea what Cell's instruction width is, but 4 bytes is a fine proxy), splitting LS half and half instruction and data gets you 32K static instructions which is huge for a computational kernel. Moreover, dynamic linking just plain isn't hard and it naturally combines with overlays to allow arbitrarily large code executed in a fixed amount of space. That's too hard for you? Luckily only one guy needs to write it for Cell and we can all use it. However, that's really more for the future. The whole raycasting portion of our code compiles down to 60KB or so. Similarly, the stack size limitation just doesn't seem interesting to me. I've probably spent too long writing kernel code and other specialized code, but if you need more than 4KB, or maybe a whole 16KB, of stack space then you're not writing for performance. And if you don't want performance, get that code back on the PPE where it won't bother us. All told, in a fairly pessimistic scenario, you're left with 128KB or 96KB of space for data. If you're just replacing pointer dereferences with DMA that's tons of space. Actually, there's 2KB of register file (128 x 4-word wide) and that's probably enough.
What's that? Replacing pointer dereferences with DMA is unusuably slow? So it is. If you grant my conservative assessment that there's 96KB of LS available for data then just use it as a simple cache. It takes a day or so to code a simple direct mapped cache (remember, we're discussing how hard it is to get code up and running on Cell. By this point, the naive code was working in the last paragraph and we're just looking for any cheap extra boost) and it's a tiny amount of extra instructions. That 96KB of code is 6x the L1 on a Pentium 4 (and you've got 2KB of registers where the Pentium 4 has to use its L1 to compensate for its tiny handful of registers). So, in exchange for another day of porting (we're almost up to 1 week for 1 grad student who's easily distracted and a little lazy) not only have you ported your app, you've smoothed out the most unreasonable shortcut you used to get the port working.
If your application doesn't benefit from caching then the bad news is implementing the simple cache won't help you on Cell. However, the worse news is that your execution time on normal CPUs is already as bad as the synchronous DMA version on Cell. Or else, you have some a priori knowledge of your algorithm that lets you prefetch or do some other contortions on normal CPUs to get performance up. In that case, there's some good news-- there are no games on the SPE-- you just tell the DMA controller what you want it to prefetch and it goes and does it. No funny "we may or may not honour the hint" prefetch or non-temporal write instructions and no irritating hardware memory hierarchy working to thwart you. Programmer say, Cell do. Seriously, if you are lucky enough to have one of the workloads whose access pattern is structured then Cell is just great. Rather than hinting (or tricking) a CPU into doing the right thing to get bandwidth to memory, you lay it out explicitly for the DMA controller and it happens.
Anyhow, bottom line, we had a working port of the ray tracer in a couple of days and a reasonable starting point to beging analysis and optimization within about a week. The development environment / tool chain is fine (it's not *awesome*, but it's gcc). And the code is pretty much normal multithreaded C code with some funny Cell specific calls instead of pthread or Win32 calls for the scaffolding. So stop the fear mongering. Thanks.
22 May 2006:
16 May 2006:
I ran a pre-checkin round of performance tests (with lazy disabled, but present in the code) and confirmed that the builder didn't get any slower (except for the cbox which is trivial enough that the overhead of the initial lazy root shows up as a fraction of a millisecond) and the ray casting times are actually faster (because of short circuiting on empty leaves). More rigourous numbers when I have more time.
Oh, that's reassuring. For a few minutes I thought I'd made the code slower somehow. Then I realized I wasn't testing 1024x1024 images and 256x256 images don't get as many rays per second (presumably from high packet inefficiency / reduced packet coherence). Whew.
They assumed that no matter what, the programmer has to take the hit of rewriting his monolithic code to be multi-core aware, but then tried to qualitatively assess the programmability of various systems with "general purpose multicore" at the maximum of ease. It certainly has seemed to me like once you suck up the hit of multi-threading your program, porting it to Cell's pretty straightforward. Other than the lack of an i-cache, the constrained LS is not big deal. Our software managed cache for the raytracer is both low overhead and extremely effective. There were also some weird claims about the available GFLOPS of various systems and nothing was consistently normalized either to per-core or per-chip. Ah well.
The most awkward part is that I'm still not sure what concretely they're proposing. Certain minor memory system changes are clear, but they've already demonstrated that they only get about a 20% boost from that, which isn't nearly enough to be competitive. There's some implicit assumption that general purpose CPU vendors will tack on lots of extra FPUs anyhow, so it's just a question of getting those fed. That's certainly not consistent with what I've seen to date (except in special purpose CPUs). And I don't know the cost. Since everything was per-core it's hard to get a sense of how many SPE-like things versus how many symmetric cores I could get on a given chip. Jayanth suggested a factor of 2 or 3 difference which makes me question the worth of the balance. I don't see what I get in exchange for having half or a third as many cores other than a branch predictor and a hardware cache. The branch predictor's not huge (and adding simple branch prediction to an SPE would be a trivial feature) and my software managed cache both has much more control (good for sophisticated programmers) and minimal performance impact (compared to a theoretical perfect full speed hardware cache the ray tracer runs at 94.5% speed). Clock for clock we're right around or over the point where clock for clock each SPE is faster than a Pentium 4 so cutting the number of SPEs by a factor of 2-3 in exchange for Pentium 4 cores is clearly a bad trade for us.
The most baffling part for me was the repeated statement (it wasn't even a claim, everyone seemed to take it as obvious) that Cell is a streaming processor. I don't get it. If a multicore CPU is threaded (or at least not streaming) then Cell is threaded (or at least not streaming). Fundamentally, I can have one SPE decompress my video, another deinterlace it, etc. and I can't do that on any streaming architecture I know. Certainly I can make all the SPEs work together in a streaming fashion, but that's true of any threaded architecture. The models are inclusive in that direction. But not in the other and Cell pretty clearly is threaded in the sense that it has 9 completely isolated and independent threads of execution that are available. Moreover, the memory system (and the DMA controller) allows arbitrary memory access patterns (with sufficient foreknowledge) instead of the a priori access patterns streaming demands. The whole concept of implementing a software managed cache in the heart of the inner loop of the raytracer should make that clear. (Of course, the fact that you can software pipeline among the SPEs should also make that clear). This Cell = streaming meme is very strange.
15 May 2006:
Okay, bizarrely, that turns out to be completely untrue. When I rewrote the code to do the test at the bottom of refineNode() before recurring, all the big scenes got nontrivially slower (tens of milliseconds slower). That doesn't make much sense since there should exactly the same amount of work per node (other than the root) still and near the bottom of the tree, we should save recursion. I guess the code may have gotton a tiny bit more branchy. Still, that's weird.
Bottom line: The infrastructure for the lazy builder doesn't impair the offline builder.
8 May 2006:
I'd sort of settled into 4 bundles per packet (16 ray packets) by default when experimenting with large packets. Because of awkwardness in my sampler, I'm stuck with power of two sized packets, so I can't explore the space too densely without fixing the sampler. Instead, I ran through the robots, kitchen, and bunny with 2, 4, and 8 bundles per packet. For the kitchen and bunny, 4 was definitely the highest point. For the robots, 8 was about 3% faster. When I tried them at 16 though, performance fell back down to the same level as 4. So, I guess I'll stick with 4 bundle packets for now.
Sad. I just went and tried the current version of our brook kd-tree code on the various boards in the lab and it doesn't work at all. The ATI board just goes into an infinite loop and the NV board produces the same sorts of artifacts as the board I have. I'm so glad I don't care very much about GPUs.
However, even once I correct the timing, the rasterization numbers are pretty good. They're pretty much a direct function of the number of triangles in the scene (no surprise since I shove all the primitives in a vertex array and dump them on the board). I've done some timing with and without GL_LIGHTING enabled. Without it, the cbox manages over 1350 fps down to 95 fps for the kitchen. With lighting, cbox only drops to 1310 fps, but the kitchen drops to 64 fps. Those are all at 1024x1024. Messing around, performance is clearly dependent on resolution, but not in any simple way. 512x512 looks to be a bit of a sweet spot for this particularly board.
5 May 2006:
First, build times (release build, single run, no avg, no nothing). Note that 'build time' is the entire time it takes to run the constructor including packing the triangles into primitives, building the tree, and building all the precomputed triangle intersection numbers:
Now, raycasting times for 1024x1024 images, 4 bundle packets:
CPU Shading times are pretty awful, but then that's no surprise. I haven't done careful measurements against even the Brook shading code yet, though.
P.S. If you're curious, my X800XT can rasterize and do direct diffuse lighting at more than 1000 fps (equivalent to casting 1 GRays/s). That's fast enough that I feel obligated to double check the timing code, but it was accurate on the X300.
25 April 2006:
7 March 2006:
Of course, the bad news is that I haven't done any of it. Or seemingly much of anything else. It distinctly feels like the last few weeks have been juggling and treading water. It's already time to synthesize the new things we've read about and the ideas we have in flight and figure out what we're really going to do. I think I should just push forward with a rasterization based system and to heck with trying to fit it nicely into the existing raytracing framework. Oh, and course, ATI still can't release a driver that works. We finally have one that returns a value other than zero for occlusion queries, but the full blown raytracer generates either impressively wrong results or infinite loops depending on configuration. I love the choice of fast but hopelessly broken versus unusably slow. Maybe someday we'll finally get that Cell and be a the mercy of a new vendor's software system (without even a mature API that can supposedly be used).
9 February 2006:
For reference, here's the reduction in packet intersections with perfect mailboxing. Mailboxing has a bigger relative benefit with packets than scalar intersection because the packets visit more a few more nodes. This is with 256x256 scenes.
Given my belief that intersection is roughly half of the total ray casting time, I must be doing something wrong not to be able to capture any speedup on the bunny or robots.
2 February 2006:
It was good to get something down as a starting point (though really that's a distillation of a series of whiteboard "implementations" and discussions Tim and I had) and I've started accumulating additional slides and descriptions to have around for future presentations.
Now I should actually get more organized about writing directed code.
20 January 2006:
19 January 2006:
I do think that focus is beginning to emerge through, or despite, it all. Unfortunately the two outside sources whose feedback I sought are buried under SIGGRAPH submissions. On the upside, hopefully I'll get some interesting papers to read fairly soon. On the downside, no feedback for a while.
Interregnum: In the midst of that last paragraph, local network conditions melted down completely and after an hour or so, I gave up and left. I've finished out my sentence, but hence the abrupt ending. I *heart* NFS.
13 January 2006:
Now, imagine you have an n * m float array on the CPU named buffer[] that you streamRead() into an n x m stream named s<> (created via stream::create(m, n)). You might expect that buffer[ x + y * m ] is the same as s[x][y]. Nope. It is, however, the same as s[float2(x, y)] and the same as s[y][x]. That's right, s[float2(x, y)] == s[y][x] and not s[x][y]. I can't even really blame brook for that because it's really how cg and hlsl operate. I realize there's all sorts of "legacy context" that can be marshalled to defend this. I'm not interested. It remains confusing as hell and each time I go to use 2D gathers or iterators I have to write a little test app to remind myself.
To make matters worse, when using 2D streams as proxies for long 1D streams, the raytracer code sometimes holds the X width fixed at 1024 and varies Y based on total length and sometimes holds Y fixed at 1024 and varies X. Argh. After some quality time with the whiteboard, it seems pretty clear to me that holding X fixed makes the math simpler (regardless of which you hold fixed, the X width determines the 2D <-> conversion so you want it to be a constant). It's also the one that seems more natural to me. Thus, of course, Y is the dimension held fixed for the one set of data beyond my direct control.
12 January 2006:
Over break I came across a few interesting links. Intel posted an article on SSE raytracing back in November that I didn't spot until recently. It goes through some background and nearly the complete code to do traversal and intersection via compiler intrinsics. There's also the latest edition of Ray Tracing News with a bunch of material of varying degrees of interest (to me personally, of course).
And, I finally took the time to go through and read this Master's thesis on GPU raytracing with grids, kd-trees, and BVH's. I've known about it for a while and discussed possible reasons for his BVH so dramatically outperforming a kd-tree with Tim, but hadn't read it. After reading it, I'm a bit disappointed. While at a high level, he implemented kd-trees (or at least the same modified kd-tree algorithm we used), his implementation is suboptimal in lots of ways. Probably the two most important are that his tree nodes are a float4 (2x the size they should be and on a workload he demonstrates is memory bound) and that his implementation is a mixture of multipassing and pixel-shader 3.0 branching I found hard to follow. It certainly didn't sound like there was any z-culling, which is definitely going to vastly inflate the runtime and the branching / looping support in NV4x and G70 is pretty awful as I've described before and as GPUbench results demonstrate. The BVH implementation though, was quite clever.
16 December 2005:
15 December 2005:
As before, these numbers are just the time to walk the rays through the kd-tree (and intersect any primitives), not the time to shade, generate the eye rays, or anything else.
13 December 2005:
One odd thing-- the various bits of documentation encourage enabling flush to zero and denorms are zero mode as allowing faster math. However, it appears that having those enabled also opens me up to this input assist business.
While I was still trying to decipher this, I hacked up the SSE intersection code (and constructor) so that the triangle data was laid out in a more SSE friendly way and I did the reads and shuffles myself instead of hiding them behind _mm_set1_ps(). _mm_set1_ps() actually does a great job if you're reading a single value. When you're reading 12, though, it's better to compress down to 3 128-byte reads and do the shuffles yourself.
A final bit of bad news-- the input assist "impact" (a random scalar vtune produces to allow you to gauge how bad a problem is) gets a bunch worse when I switch from the cbox to the robots. Sigh.
9 December 2005:
I realized that now that I have per-packet intersection, I could turn mailboxing back on, so I added the code and checked the effects. The stat counters show reductions in intersection calculations, but the performance numbers aren't materially affected. Sad, and a little odd.
So, I augmented the stats (and repaired some counters that fell by the wayside when Tim redid the connection between the builder and the accelerator) out of curiosity. There are a lot of empty leaves in our trees. Most of the scenes look to average roughly two triangles per non-empty leaf, but the robots are way higher. Traversal of all the scenes looks to visit primarily empty leaves, which is also interesting.
8 December 2005:
6 Decemember 2005:
2 December 2005:
This is on my 3.4 GHz Pentium 4 with hyperthreading enabled (but only a single-threaded application). The speedup on the cbox and glassner scenes makes me extra suspicious that the SSE-ification of intersection had such an impact for reasons other than raw compute since even in an ideal case it should have maxed out at making only the intersection portion a little less than 4x faster. It'll be interesting to see what impact smoothing the final edges has (prepacking the triangle intersection data better and propagating SSE packed hits end to end). And of course, there's always more optimal intersection routines like the one Ingo (Wald) describes in his thesis.
Also, given how heavily the SSE tracer uses compiler intrinsics that pretty much dictate the resulting assembly, I'm baffled that the debug SSE tracer is so much slower than the release version. With the C tracer, there's roughly a factor of 3 between them. The SSE tracer has a factor around 5. I guess the argument is that once you've truly pared the algorithm down to its essentials, introducing noise (or just debug-friendly assembly) has a larger relative impact. I was still surprised. For reference, in our Makefiles, debug and release mean:
1 December 2005:
For reference, I also benchmarked each of the routines. With release builds, the SSE generation makes 65536 rays in 1.13 msecs and the C version takes 5.18 msecs. The works out to be about 58 million eye rays per second for SSE and 13 million for C. Those numbers remain roughly accurate at 512x512 and 1024x1024 though there's a small falloff at 1024x1024 that I suspect is caused by memory pressure (created elsewhere in the system) as much as anything. They're also stable across scenes (which is good. I'd worry if the CPU could do identical math on certain values appreciably faster or slower than on others). I should note that I haven't heavily optimized either version, though the nature of reformulating the SSE version meant I manually hoisted some computations and precompute some values. So these certainly aren't the best times possible, just a generic data point.
Wow! With SSE eye ray generation and thus more conveniently aligned and packed rays to the SSE traversal, the SSE packet traversal is finally faster than the straight C single ray traversal. And the robots scene is over a million rays per second. Looking at the code, there's some more I can shave by using the SSE representation to quickly detect invalid packets faster (or I could just eliminate them up front, but I'm too lazy) and I can squeeze out a couple of the branches in the inner loop(!) by assigning the new node index to near or far with conditional moves and guarding the stack push with a branch. Bears some testing.
29 November 2005:
18 November 2005:
However, before diving too deeply in there, I've been working on the single ray intersection code. After staring at it for a while and quizzing my officemate on how dot and cross products intermingle, I pulled out a bunch of redundant computation. While I was in there, it seemed like a good time to pack intersection data in the same order as it's stored in the kd-tree nodes instead of randomly accessing into the array of vertices. All told, it definitely helped, but I didn't keep precise numbers.
After that, I decided to bite the bullet and experiment with mailboxing. The results were somewhat confusing. With perfect mailboxing (one mailbox per primitive that stores the id of the last ray it saw. If a ray ever encounters a primitive it's already seen, it just skips that primitive) and 256x256 scenes I get the following:
That seems like a pretty compelling demonstration that mailboxing is useful. At the same time, the cbox numbers don't make a lot of sense. With mailboxing, despite the fact that the mailboxing accomplishes nothing, it makes raycasting more than 5% faster?! I have no explanation. There's a similarly positive effect on the other scenes, but it makes more sense there. I wonder if the compiler's generating different code that has a weird impact.
And no doubt some more variations I'm condensing down.
17 November 2005:
Also, I cleared up my confusion on why the packet tracer saw such a higher MaxIntersections count (number of leaves visited by the packet that visited the most leaves). First off, I was scaling it by the packet size. Secondly though, I only track it across the whole packet, not across the rays within the packet, so imperfections in coherence can inflate the MaxIntersections count for a packet even if no ray actually is active in that many leaves. Pretty obvious in hindsight.
Precision has reared its ugly head again. Take the example of ray 5102 in a 256x256 cbox scene. With the ray at a time code, this ray traverses: far, far, near, near + push, near + push, pop, hit geometry. The ray is only in the cell where it hits geometry from tMin of 1098.60 to 1098.89. As part of an SSE packet, the same ray again starts off far, far, near, near + push, near + push, pop. However, this time it shows up in the node where it should hit geometry with a tMin of 1098.83 and a tMax of 1098.62. Since tMin is larger than tMax, the SSE tracer culls this ray and concludes it misses all geometry. I'm surprised to see that the SSE traversal has that much trouble relative to standard C floats. I know there are a variety of SSE math styles available at a trade off of speed for precision, but I didn't select any of those. The only saving grace is that with a SIMD+SSE triangle intersector, I'll compute those intersections even for 'inactive' rays (because it'll be effectively free). That would catch this case, but it wouldn't help in general (if precision caused us to stop traversing one node sooner).
At this point, the stats look pretty competitive-- the packet tracer isn't doing that much extra work:
Stats for module kdtreeCPU (cbox 256x256 single ray at a time)
kdtreeCPU:Traversals 730616
kdtreeCPU:TraverseLeaves 178118
kdtreeCPU:Intersections 140214
kdtreeCPU:Push 141189
kdtreeCPU:Pop 112582
kdtreeCPU:MaxIntersections 9
kdtreeCPU:MailboxHits 20
kdtreeCPU:MaxStackDepth 5
kdtreeCPU:TriMissTHit 195
kdtreeCPU:TriMissBary 73994
kdtreeCPU:TriHit 66025
Stats for module kdtreeCPU (cbox 256x256 sse 4 rays at a time)
kdtreeCPU:Traversals 750128
kdtreeCPU:TraverseLeaves 185968
kdtreeCPU:Intersections 140333
kdtreeCPU:Push 151096
kdtreeCPU:Pop 121252
kdtreeCPU:MaxIntersections 11
kdtreeCPU:MailboxHits 19
kdtreeCPU:MaxStackDepth 5
kdtreeCPU:MissGeometry 1
kdtreeCPU:MissBadPacket 1024
kdtreeCPU:HitEmptyStack 3
kdtreeCPU:TriMissTHit 253
kdtreeCPU:TriMissBary 74056
kdtreeCPU:TriHit 66024
The 'MissGeometry' and 'HitEmptyStack' stats are poor ray 5102 and its
packet mates. The BadPacket count is the number of packets whose
direction vectors didn't agree in sign in at least one direction and
were traced ray at a time.The bad news is that the packet tracer isn't dramatically faster than the single ray tracer. In fact, the debug version is consistently slower (at least, for cbox), but the release version is slightly faster (again, for cbox). With the bunny, even the release packet tracer is slightly slower than the release single ray tracer. There is a ray of hope in all of this (pardon the pun). If I remove all the leaf intersection code and just test raw traversal speed, the release sse tracer is 2x the release single tracer for the cbox and a dinky 20% faster for the bunny.
15 November 2005:
if (hits[ii].tHit > tMaxVec[ii]) { allDone = false; }
if (hits[ii].tHit < tMaxVec[ii]) { allDone = false; }
That's right, I would declare a packet not done if any ray's tHit were
less than or greater than tMax. Clearly I meant tMin for the latter
condition. And, as I thought about it some more, I didn't even need
the latter condition. The good news is my performance is no longer
pathological for big scenes (the effect of the bogus check was that no
packet ever finished until it walked off the end of the whole kd-tree.
So, more geometry led to bigger trees led to huge performance gaps.Sadly, even that fix doesn't get me the legendary packet tracing speedup and I still see more intersections than the single ray at a time case. I worked through a simple example on the board and the wrinkle seems to come in determining when a packet is done. Here are two problematic scenarios: ray1 hits a primitive in the current cell. It's now done, but rays2-4 are still going. As long as the packet continues to traverse cells that ray1 would have traversed (if it hadn't hit anything) than my code will consider ray1 'live' and keep intersecting it with primitives, all of which will be rejected as worse hits than its current hit (after all, the current hit is the best possible hit, ray1 should be done).
Problem two: ray1 hits a primitive, but the packet keeps traversing because rays2-4 don't. Later, rays2-4 hit a primitive in a cell ray1 wouldn't have visited. That means, given the current algorithm, that ray1's current tMax is less than tMin which means tHit for hit1 is definitely less than tMax. It's hard to capture the proper tMax against which ray1's tHit should be tested since the algorithm goes out of its way to clobber tMax for inactive rays.
And now xenon has crashed while I was in the middle of mail trying to work out whether one of the stack values is truly redundant. Grr. I guess that means it's time to go home.
11 November 2005:
We're still doing vastly too many intersections (and, to some extent, traversals). I know my masking is very conservative right now, so hopefully tightening it up fix the problem.
10 November 2005:
Wow! That's certainly an interesting effect. Sort of like looking at the Cornell Box through a screen door whose vertical hatching is wider than its horizontal hatching.
Okay, it's now significantly better (there's a big difference between the SSE instructions that only affect the first component and the ones that affect all four). However, I have to fudge tMax by a lot more (10% or so) to avoid the precision problems with the cbox and it does a whole lot more traversals and intersections than the ray-at-a-time version. I strongly suspect I'm mismanaging the masks. I know I need to get the size of stack elements down. And, I'm suspicious there's some other math problem-- possibly just putting an epsilon fudge factor into the intersection routine since I still see glitching on robots and glassner. Still, progress.
8 November 2005:
Ever suspicious, I decided it was time to bulk up the brook writeQuery regression test. So, I augmented it to do the following: