GPU Notes

Sometimes, while tormenting innocent GPUs, I come across things so interesting, outrageous, unbelievable, or otherwise momentarily remarkable that I feel compelled to make records of them. This is where their stories get told.

27 August 2010:

The final i has been dotted and t crossed:
```
   Date: Fri, 27 Aug 2010 08:45:30 -0700
   From: [ Stanford Registrar ]
   Subject: Registrar Approval Mail

   Congratulations! Your dissertation/thesis id 0000000545 has been
   approved by the Office of the University Registrar.

   Please find the comments as: Congratulations!.

   The following are the details: 

   Name of the Student: Jeremy Sugerman
   Academic Career: GR
   Academic Program: CS
   Academic Plan: CS-PHD
   Milestone title: Programming Many-Core Systems with GRAMPS
```
Naturally, Pat's final comments were changes to the figures. Equally naturally, coercing LaTeX into complying took far longer than the simple nature of the requests. Here's the official pdf (slated to appear here once the University / Library digests it). As Jeff warned, the post-processing to insert the copyright and signature pages made the bookmarks and metadata a little wonky, so here's the final presubmission pdf, too.
Just shy of seven years, all told (September 2003 - August 2010). Now I have to figure out what I'm going to be when I grow up. Or at least find a job or something.

26 August 2010:

Almost there:

Despite my paranoia, and contingency plans, the Student Services Center had no qualms about accepting standard acid-free printer paper for the two required hard-copy pages. Which is good, since with electronic submission for the thesis itself, I certainly hope they just toss the title page, and scan and toss the signature page. Now I just need to make the final edits and secure Pat's final sign-off.

14 May 2010:

Success. Even if the talk did still go a little long. Now I have to track down my parents, who are lose together somewhere on campus.

7 May 2010:

Now all I need to do is actually finish preparing:

   University Oral Examination: Friday, May 14, 2:15pm, Gates 104
   Programming Many-Core Systems with GRAMPS

   Jeremy Sugerman
   Advised by Pat Hanrahan

   Department of Computer Science
   Stanford University

   2:15 PM, Friday, May 14, 2010
   (Refreshments at 2:00 PM)
   Gates building, room 104

   Abstract:

      The era of increased performance from faster single cores and
   optimized single core programs is ending.  Instead, a major aspect of
   the improvements to new processors is more parallelism: multiple threads
   per core and multiple cores per processor.  In addition, highly parallel
   GPU cores, initially developed for shading, are increasingly being
   adopted to offload and augment conventional CPUs.  Processor vendors are
   already discussing chips that combine CPU and GPU cores.  These trends
   are leading to heterogeneous, commodity, many-core platforms that have
   excellent *potential* performance, but also (not-so-excellent)
   significant *actual* complexity.  In both research and industry run-time
   systems, domain-specific languages, and more generally, parallel
   programming models, are becoming the tools to contain this complexity
   and deliver this performance.

      In this talk I will present GRAMPS, a programming model for these
   heterogeneous, commodity, many-core systems that expresses programs as
   graphs of thread- and data-parallel stages communicating via queues.  I
   will show how we built applications from domains including interactive
   graphics, MapReduce, physical simulation, and image processing in GRAMPS
   and describe GRAMPS implementations for three different architectures:
   two simulated future rendering platforms and one current multicore x86
   machine.  In these cases, GRAMPS efficiently finds and exploits the
   available parallelism while containing the footprint / buffering
   required for the queues.  Finally, I will discuss how GRAMPS's
   scheduling compares to three representatives of popular programming
   models: task-stealing, breadth-first, and static scheduling.  GRAMPS's
   dynamic scheduling provides good load-balance with low overhead and its
   knowledge from the application graph gives it multiple advantages in
   managing footprint.

3 May 2010:

Posters are by far my least preferred academic deliverable. They're time consuming to put together-- at least 50% of the time to put together slides for a presentation-- and then you need to deal with printing them, transporting them, etc. On the consumption end, they're very low leverage: the poster format is sufficiently low bandwidth that without a full-time presenter explaining, they have minimal meaning and there's at most room for a couple of listeners at a time. Basically, a poster plus presenter boils down to a watered-down talk with more limited visual aids available to a tiny audience (over and over for the duration of a poster session). The only upside is that it is potentially more interactive than a talk in a more formal setting or to a self-conscious audience.
Anyhow, despite this, I grudgingly put together a poster for the Computer Forum poster session last week (ppt) (pdf). I've ducked it for the past few years because it conveniently always falls on Wednesday, but now that I no longer go to VMware, it seemed like I should present rather than look for another excuse.

16 April 2010:

What a low productivity week. Really, the only directly useful thing I did was schedule my defense. That's indisputably critical forward progress, but it's not much to have to show. Other than that, the week has been a blur of multiple belated birthday pizza trips, a slightly awkward inclusion in the periodic CS faculty and staff celebration (that did include an interesting and useful conversation about thesis defenses and past anecdotal highlights from Mendel and Don Knuth), a barely attended PPL meeting, my officemate's wild-and-crazy GCafe on providing audio feedback in image quality in digital cameras, and a handful of publication review requests (who knew there was an IEEE Computer Architecture Letters?).

6 March 2010:

Sometimes just because a thing can be done does not necessarily mean that it should be done. Thus it appears to be with Python list comprehensions. It occurred to me that one win-win way to avoid my distaste for hanging indentation at the end of nested loops and to feel I am writing idiomatically, not just correctly, is to use more list comprehension more broadly. The good news is that it worked. On both counts. That bad news is that now I have a new problem. I have no good ideas how to wrap this reasonably to 80 column lines:
```
   # Note: Sometimes a thing must be done, even if of dubious merit,
   # just to prove it is possible.  Thus it is with this (awesome) list
   # comprehension.  And people like to complain about perl programs...

   data = dict([(k, [ (r['testName'], r['cyclesPerCore'] / 100000) for r in allBySched[k] if r['numCores'] == maxCores ]) for k in allBySched ])
```
That's actual production code for pulling out the appropriate subset from the complete test results (a dictionary who entries are lists of inner-dictionaries; each inner-dictionary contains all the parsed stats from one run) to plot total execution time for each test as a function of scheduler mode.
Earlier, I caught myself typing initial experiments (into the interpreter interactively) to add layers of reduce() and some choice lambdas to "streamline" other pieces even more. Luckily, I stopped before the fevered muttering got too strong. As entertaining as it is to think in functional language constructs, the resulting code rapidly attains write-only status. I mean, I'm uncertain whether I'll still understand the snippet above when I wake up in the morning.

4 March 2010:

Note to self: 'g++ -x c++ -E -dm - < /dev/null' is a useful way to dump all of the stray #defines g++ is applying to my files.

28 February 2010:

This morning I couldn't even spell Python and now I are one. Err, or something. Which is really to say that after years of read-only interaction with Python, I've taken my first foray into writing it as I ported my perl module for parsing GRAMPS stats. Apparently perl is arcane (and slightly malodourous) knowledge for kids these days (i.e., my fellow PhD students and undergrads). In the spirit of generosity I have acquired with my advanced years, I am indulging them. Now if I just start writing PHP too, I can totally mingle with the 20-something cool kids. Or I could pick up Scala to mingle with the mid-to-late 20s slightly-European-air-of-sophistication set (shameful confession: while I have no real interest in PHP, Martin Odersky's PPL talk did inspire me to consider where I could find some excuses to write some Scala code as part of doing useful work. Nothing leapt to mind, though).
Anyhow, I have to say that the Python compiler / interpreter is a lot grouchier than than perl. I think it stems from two (correlated) factors: a designer with an academic interest in programming languages and first-class exceptions as an intrinsic language feature from the start. Perl would attempt to do what I want in the name of getting the job done. Python wants to teach me a little something about programming and happens to have this nice exception mechanism sitting around with which to thwack me if I attempt expedience (or sloppiness, depending on the eye of the beholder). On the other hand, the syntax, especially for nested types (e.g., dictionaries-of-dictionaries, arrays-of-dicts) is nicer. I do miss the closing curly braces, though. It looks weird to leave occasionally deeply indented lines hanging there terminating a function or a class. I can make things look nicer to my eye with a terminal 'return' to close functions, but that's probably not idiomatic python.
And the language nearly lost me, no matter how hip it is, when I went to use the C ternary operator (which tends to happen within the first 20 lines of code I write in any language), failed, and read through the history of rationalizations that ultimately lead to Python's alternative ternary syntax. I would probably have walked away with no regrets if it were still the pre-2.5 days of ternary hand-wringing and quibbling.

26 February 2010:

I replaced the fancy ATI card in my machine with a cheapo NVIDIA card cannibalized from an idle machine and, with no crashes or reboots, had X and accelerated OpenGL working. A few more minutes' tinkering persuaded it to deal with my multimon setup correctly. It's really quite remarkable: text is actually anti-aliased, double-buffered visuals don't fail to redraw or corrupt their contents, the mouse cursor doesn't get garbled at the drop of a hat, and grampsviz's line drawing is instantaneous instead of taking tens of seconds. It's really a shame how much trouble I consistently have with ATI drivers (regardless of OS or proprietary versus open source)-- I think they've had clearly preferable hardware for about 50% of the GPU research projects I've done since starting grad school, but using those GPUs has been such a trial that I've developed quite the aversion.
Of course, the universe is not without some conservation of grief. It appears that the default experience remotely displaying GLX applications back to an X server running the nvidia binary drivers is for the client GLX code to fail to find any usable visuals and blow up ("couldn't find RGB GLX visual or fbconfig"). The entertaining part, though, is that all that's really happening is that at least one member of the nvidia / X.org / mesa trio is lying about itself in a way that prevents software (indirect) rendering from kicking in. When I explicitly set LIBGL_ALWAYS_INDIRECT=1 then everything works just like it used to. The Internet indicates many users have bounced off this problem with varying degrees of recognition and success. I haven't found any maintainers (distribution or among the trio) who acknowledge the problem even enough to point blame at someone else, let alone adjust their code so that the right thing happens. Ah, well.

24 February 2010:

C++ really is big enough that one never stops learning it. Today, completely by accident, I learned that C++98 defines long-form names of boolean and bitwise operators. I learned this when my brain switched from editing perl files to editing C++ files and my fingers were slower making the transition.
After some testing, I glanced back at my code and discovered g++ had been happily compiling:
```
   if (maxRecord != 0 and maxRecord < entries.size()) {
      entries.resize(maxRecord);
   }
```
without any complaint. Not only that, my editor had been happily highlighting 'and' as a reserved word. After some puzzled exclamations and skepticism, I satisfied myself that g++ would accept 'and' in place of && and gcc would not. After still more digging (producing a search query involving the word 'and' that produces useful results is nontrivial), I learned the result above: this is all by design. Maybe the C++ language designers want closer source compatibility to scripting languages. Or more ways to torture the heroic non-native English speakers who endure already highly anglocentric programming languages (what about internationalization? Clearly my compiler should look for different && synonyms depending on my locale...).
Maybe a future language revision will accept "x equals one plus one." (I verified that 'plus' is not (yet) a keyword). And then we'll have to be careful about mocking COBOL programs. Or SQL queries. At least APL will still be thoroughly mockable.

23 February 2010:

On an unrelated note, some day, on some OS, I will have an experience attempting to install some version of some variant of an ATI graphics driver that does not leave me despairing and frustrated. Sadly, that day is not today. Still, after about an hour, many reboots, and cycling through two X.org and one proprietary driver, at least I clawed my way back to my previous stable, but suboptimal state. If only blended lines weren't pathologically slow and I had anti-aliased text. If only.
You'd think that people defining 64-bit Linux / glibc would try a little harder for compatibility (okay, in actuality I think nothing of the sort given my prior knowledge of the attitudes and track records involves. It still bugs me). Anyhow, ignoring all big items, consider as exhibit A calling printf() on a 64-bit integer. On 32-bit systems, a 64-bit integer is a long long and requires %lld. On 64-bit systems, a 64-bit integer is a long and requires %ld. Moreover, just to be helpful (irritating) gcc warns about format string mismatches. This leaves me with no good way to write a single version of code to printf() 64-bit numbers that compliles cleanly on 64-bit and 32-bit versions of Linux. For bonus points, it seems as if gcc's printf warning, which is generally quite helpful, is without merit in the 64-bit case. In my testing, using %lld to printf() a long on a 64-bit OS works fine (without corrupting any prior or subsequent arguments). Sigh.

27 January 2010:

Ubunutu just works often enough that when it doesn't, it manages to in ways that severely annoy me. For a while now, I've been wrestling with the conundrum that while 'man pthreads' and even 'man pthread_setaffinity_np' produced working manual pages, I had nothing for the simple function like pthread_join, pthread_self, etc. I fed apt-cache search every combination of pthread related terms I could devise and couldn't find any packages with those man pages. Finally today, the irritation crossed some invisible threshold that prompted me to poke around the Internet until I found a solution.
I now know where Ubunutu chooses to package the man pages for all the core pthread_* functions, but NOT pthreads(7) or a number of support functions: manpages-posix-dev. Of course. To aid discoverability, the package has this (uselessly) pithy description:
```
   Description: Manual pages about using a POSIX system for development
   These man pages describe the POSIX programming interface, including
   these two sections:
       7 = POSIX header files (with 7posix extension)
       3 = POSIX library calls (with 3posix extension)
```
Grr. I suspect Ubuntu inherited this particular partitioning of content from Debian where it was employed for the sort of pedantically correct and unreasonably stupid reasons of which that project is so fond.
And yes, after dodging it for ages, I find myself working on a pthreads-derived GRAMPS runtime. I blame Christos.

4 January 2010:

Well, that was a brisk few months. I guess I've been channeling my writing energies elsewhere for the past quarter (largely into writing code: GRAMPS and various standalone toys). At the end of last quarter, I used my GCafe slot as an opportunity to collect up the map-reduce and physics stuff and their derivative GRAMPS influences into a cohesive end-to-end narrative: (ppt) (pdf). The GCafe reception was excellent. Pat was effusive, as always (often, at least), but I also got good questions and observations from multiple people.
I also put together an abbreviated (read: harder to follow) version that I presented at the PPL retreat the next day: (ppt) (pdf). That reception was much more reserved. It's partially setting-- a large room full of people more in listening (plus checking e-mail on their laptops) mode-- and partially cramming the whole thing into 20-ish minutes.
Mike asked how GRAMPS compares to TBB, Grand Central Dispatch, and the general emerging workqueue based toolkits. I had my suspicions and superficial answer at the time. Since then, I've read through a lot of the documentation for both and realized how thoroughly different they are (from GRAMPS-- they are very similar to each other). So much so that I need to think about how best to relate them at all (since that is a very valid-seeming question). Also, good grief do the task scheduler operations in TBB require excruciating micromanagement boiler-plate! It is very evident that their intent is to be the invisible underpinning of the TBB parallel data structures and a last-resort escape hatch rather than a primary API for users.

28 August 2009:

That was a cute bug. For various reasons, the GRAMPS deadlock detection code is imperfect (it's somewhat optimistic). I finally decided to use the simulator to do much more robust deadlock detection. On each clock, each execution core (and fixed function unit) now returns whether or not it made forward progress and the simulator panics if forward progress stops. Except that testing showed the sanity tests blowing up in two completely different ways!
The first reflected a flaw in the simulator / a flaw in my assumption that execution units are the only measure of forward progress. For some applications, the last work can be consumed in the middle of the pipeline (even without loops). For example, any time the last triangle in a scene is backfacing or completely occluded, the rasterization pipeline does its last work at RAST: it eats the triangle and emits no fragments. At this point, RAST completes and GRAMPS sends NOMORE along all its outputs. If the downstream stages were threads, then they would explicitly consume the NOMORE and finish. When the downstream stages are shaders, however, GRAMPS itself consumes the NOMORE and propagates it down stream. Since the run-time is built into the simulator, no simulated instructions are issued and the simulator shows no forward progress for as many cycles as there are shader stages between the last stage to do work and the bottom of the pipeline. I could fix this by enhancing the scheduler's handling of NOMORE for shader stages, but that seems like a poor optimization / complexity tradeoff for a tiny penalty (single cycle) for a rare occurence (once per stage over the lifetime of the application), so I just added 10 cycles worth of tolerance into the deadlock detector.
The second flaw was less expected (I had some notion from things I have seen in logs before that termination sometimes took a couple of cycles to be recognized)-- one of the sanity tests had a run of more than 98 thousand cycles without making forward progress after which it noticed it was done exited cleanly. It turned out that the micro-core scheduler never made any calls to GrCheckForDoneOrDeadlock(). Thus, GPU-like GRAMPS did not notice termination for execution graphs with cycles until the tier-1 scheduling period elapsed for the fat core (every million or so cycles). Oops.

26 August 2009:

Lots of checkins today. The brunt of my writing energy went into checkin messages (and the code itself!), so here's some key bits from the most recent checkin:
With these changes, I able to use in-place combine shaders to implement a combine stage in the map-reduce histogram. Actually, I was pleasantly surprised that my scheme for propagating the final results fit seamlessly into the same interface the reduce stage already expected, so inserting combine was completely transparent.
... One disappointing thing is that the run-time currently appears to cope very poorly with histogram-combine. That is, it's a marginal footprint savings over reduce and it's way slower. I'll look into it more when I'm back on campus, but one nuisance problem is that GrTier1CanPreReserve() always iterates through subqueues starting from zero. When there's a really deep subQueue with a high index (as happens to the histogram bucket for brightest pixels) then it gets starved until all the lower numbered buckets are completely empty. This makes it (a) really deep and (b) can lead to deadlock if it actually fills and enough map instances block trying to push more. Bottom line: a good reminder that push coalescing is really poor with queue sets and that always scanning a list from the front has fairness problems.
Bottom line: I can now slide a combine stage transparently into a map-reduce app (when the reduction operation supports it) and GRAMPS can do the appropriate parallel partial reduction. But, there's some dubious suboptimal paths that currently get tickled in the run-time along they way.

25 August 2009:

Hooray! GRAMPS can now run parallel partial reductions using shaders. There's still some definite tidying up before I can check it in and then there'll be a round of refinement back applying it to queue sets and such, but nevertheless, hooray. And man, there are a number of interesting (read: sticky) details about both reduce/combine style operations and operating in-place on a queue (i.e., binding the same queue as input and output and consuming it in place).

21 August 2009:

I realized I never put up links to the GRAMPS SIGGRAPH material. Oops. Here's the fast forward slides and here's the slides from the talk itself: (ppt) (pdf). The talk itself seemed well received (and very different from the other talks in its session). I got some good questions, but I was surprised no one asked me the good question I was expecting: How do GRAMPS and OpenCL overlap? Conceptually OpenCL has something it calls a queue to which one submits work (which is nothing like a GRAMPS queue, actually) and OpenCL has both "tasks" and "kernels". And, there was lots of OpenCL hype / discussion going on. I was strongly expecting some question privately, even if not during the session, but it never happened. It is actually an interesting question-- at the superficial level they're very different and suffer from the unfortunate overloading of many key streaming and parallel terms. At the deeper level, though, I think they have very compatible views of the mature future.
OpenCL right now is, not necessarily intentionally but nevertheless indisputably, extremely short-term / low horizon. It's at its best standardizing the status quo (or at least the status quo of what I suspect the GPU vendors have designed, even though neither Larrabee nor whatever AMD/ATI's OpenCL capable GPU will be called exist publicly). It's very weak-- somewhere between flawed, naive, and just weird-- when trying to be more future looking and future proof. I think that's mostly fine, but suspect some of the wackier things in it will become the OpenCL equivalent of OpenGL's built-in spline evaluators: cute legacy niches that every implementation is stuck supporting, but are never used in practice. I have a fairly extensive set of impressions and future guidance for OpenCL after digging into the conceptual / specification side of things at SIGGRAPH and hacking on some simple apps last week. Maybe I'll take some time to write them up and post them.

14 August 2009:

Alright, I have produced a simple OpenCL test program. I harvested and rewrote one of the AMD OpenCL sample applications to be standalone (my specific motivation was to disentangle their sticky C++ harness and what looked a lot like prebuilt / binary-only utility functions). I run it with this trivial kernel that just writes the thread ID to corresponding place in the output buffer.
As before, a few minor surprises showed up in testing. There's a significant startup cost associated with AMD's OpenCL implementation. On my machine, the app takes at least 80 milliseconds even with a tiny work size. Interesting, even as I scale the work size up to the point where it takes 80 milliseconds just to run the kernel (as reported by clGetEventProfilingInfo()), the entire app itself only takes 100 milliseconds. Years ago, when we were writing gpubench and testing Brook, we saw that behavior launching kernels on a GPU and explained it as a fixed host side driver cost and variable asynchronous GPU cost. That makes somewhat less sense when everything's running on the host / CPUs and thus cycles are zero sum.
The other surprise was that valgrind gets mildly unhappy at the AMD OpenCL library when I run my simple test. It's intermittent and never shows up with extremely small work sizes (1 - 16 entries), but regularly appears at 256 instances or higher. Valgrind complains of a number in invalid 4 byte reads and writes deep in the guts of the library (e.g. a bunch of different offsets inside cpu::WorkGroupOperation::execute()). I wish I knew what it were doing enough to have an idea of if this is somehow my fault.
I'm also curious whether it's attempting to enforce my WRITE_ONLY buffer or if I could actually read from it without problems. And what's going on with alignment. The demos do, but I do not, explicitly align their allocations. I wonder if it's a performance win, total superstition, or a correctness requirement I'm violating without the library enforcing.
Today's OpenCL question: what is the actual contract associated with CL_MEM_USE_HOST_PTR? I can't tell if it's useful or not. On the surface, it appears quite useful: I had a host pointer to OpenCL and kernels, etc. operate on it "in place". However, the devil is in the details of "in place". The specification explicitly says:
```
   OpenCL implementations are allowed to cache the
   buffer contents pointed to by host_ptr in device
   memory. This cached copy can be used when kernels are
   executed on a device.
```
This makes sense. Without this escape, a GPU or accelerator would be stuck using PCI-e or other shared memory to access the buffer object and it would be too slow to be useful. Where I wish the spec would chime in is when the implementation is required to flush the cached copy back to the host. The entire point of CL_MEM_USE_HOST_PTR, from my perspective, is a syntactic simplification to save boiler-plate calls to copy my data in and out. I would like that to mean that as long as I make sure to wait for all kernels accessing the buffer to finish, my host code can directly consume its pointer to the data. I.e., any device side cache is filled on launch and flushed on completion so that host code is free to do arbitrary things outside of that time. If correctness requires explicit additional synchronization or calls to clEnqueueMapBuffer then CL_MEM_USE_HOST_PTR is useless as a syntactic aid and seems effectively useless overall (I can't construct any compelling performance optimizations it enables under those conditions). I just wish the specification would rule one way or the other. It's such a natural question and surprising omission.
(Note: one could construct an argument that it has an additional advantage of conserving host address space, but (a) even huge device memories are pretty small in the grand scheme of things and (b) in the age of 64-bit hosts, address space is even more plentiful).

13 August 2009:

I decided to take a step back and produce a simple diagnostic program. Thus, I offer up clInfo. It enumerates all the OpenCL platform and device information and dumps it. Here's what the AMD CPU OpenCL run-time produces on this machine:

   For test only: Expires on Wed Sep 30 00:00:00 2009
   Found 1 platform(s).
   platform[(nil)]: profile: FULL_PROFILE
   platform[(nil)]: version: OpenCL 1.0 ATI-Stream-v2.0-beta2
   platform[(nil)]: name: ATI Stream
   platform[(nil)]: vendor: Advanced Micro Devices
   platform[(nil)]: extensions: 
   platform[(nil)]: Found 1 device(s).
        device[0x9ac2d56]: NAME: Intel(R) Pentium(R) D CPU 3.00GHz
        device[0x9ac2d56]: VENDOR: GenuineIntel
        device[0x9ac2d56]: PROFILE: FULL_PROFILE
        device[0x9ac2d56]: VERSION: OpenCL 1.0
        device[0x9ac2d56]: EXTENSIONS: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store
        device[0x9ac2d56]: DRIVER_VERSION: 1.0

        device[0x9ac2d56]: Type: CPU 
        device[0x9ac2d56]: EXECUTION_CAPABILITIES: Kernel 
        device[0x9ac2d56]: GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
        device[0x9ac2d56]: CL_DEVICE_LOCAL_MEM_TYPE: Global (2)
        device[0x9ac2d56]: SINGLE_FP_CONFIG: 0x7
        device[0x9ac2d56]: QUEUE_PROPERTIES: 0x2

        device[0x9ac2d56]: VENDOR_ID: 4098
        device[0x9ac2d56]: MAX_COMPUTE_UNITS: 2
        device[0x9ac2d56]: MAX_WORK_ITEM_DIMENSIONS: 3
        device[0x9ac2d56]: MAX_WORK_GROUP_SIZE: 1024
        device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_CHAR: 16
        device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_SHORT: 8
        device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_INT: 4
        device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_LONG: 2
        device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_FLOAT: 4
        device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0
        device[0x9ac2d56]: MAX_CLOCK_FREQUENCY: 2992
        device[0x9ac2d56]: ADDRESS_BITS: 32
        device[0x9ac2d56]: MAX_MEM_ALLOC_SIZE: 1073741824
        device[0x9ac2d56]: IMAGE_SUPPORT: 0
        device[0x9ac2d56]: MAX_READ_IMAGE_ARGS: 0
        device[0x9ac2d56]: MAX_WRITE_IMAGE_ARGS: 0
        device[0x9ac2d56]: IMAGE2D_MAX_WIDTH: 0
        device[0x9ac2d56]: IMAGE2D_MAX_HEIGHT: 0
        device[0x9ac2d56]: IMAGE3D_MAX_WIDTH: 0
        device[0x9ac2d56]: IMAGE3D_MAX_HEIGHT: 0
        device[0x9ac2d56]: IMAGE3D_MAX_DEPTH: 0
        device[0x9ac2d56]: MAX_SAMPLERS: 0
        device[0x9ac2d56]: MAX_PARAMETER_SIZE: 4096
        device[0x9ac2d56]: MEM_BASE_ADDR_ALIGN: 1024
        device[0x9ac2d56]: MIN_DATA_TYPE_ALIGN_SIZE: 128
        device[0x9ac2d56]: GLOBAL_MEM_CACHELINE_SIZE: 64
        device[0x9ac2d56]: GLOBAL_MEM_CACHE_SIZE: 65536
        device[0x9ac2d56]: GLOBAL_MEM_SIZE: 1073741824
        device[0x9ac2d56]: MAX_CONSTANT_BUFFER_SIZE: 65536
        device[0x9ac2d56]: MAX_CONSTANT_ARGS: 8
        device[0x9ac2d56]: LOCAL_MEM_SIZE: 32768
        device[0x9ac2d56]: ERROR_CORRECTION_SUPPORT: 0
        device[0x9ac2d56]: PROFILING_TIMER_RESOLUTION: 1
        device[0x9ac2d56]: ENDIAN_LITTLE: 1
        device[0x9ac2d56]: AVAILABLE: 1
        device[0x9ac2d56]: COMPILER_AVAILABLE: 1

A few surprises popped up. As is immediately visible, AMD doesn't bother returning a valid platform ID. Instead they rely upon the "implementation-defined" semantics for NULL. Fair enough, it's their implementation. More surprisingly, a number of device parameters turned out to be 64 bit quantities. Rather than sort it out, I just always use a long long for numeric properties (it's not as if peak integer performance is critical, so simpler code wins). Oh, and valgrind informs me that AMD's OpenCL libraries leak memory (though thankfully my super simple program itself does not).

I wonder if I can find someone who has the NVIDIA OpenCL drivers to compare results. Or I could dig up an NVIDIA GPU, thereby completing the irony (running AMD's OpenCL run-time on a machine with Intel CPUs and an NVIDIA GPU) and also freeing me from my BrokenGL versus broken ATI driver installation grief.

How much credit should I give OpenCL implementors? If reasonable people wrote the run-times, then clCreateKernel() should increase the ref count on the program object (or else cache everything it needs locally). That is, modulo error checking, the following sequence should work fine:
```
     /*
      * Create a program long enough to bind the specified kernel and then
      * decrement its ref count immediately.  This should work fine unless
      * the kernel object both depends on the program object and
      * clCreateKernel() fails to call clRetainProgram() internally.
      */

     src = FileToString(inputFile);
     srcLen = strlen(src);
     program = clCreateProgramWithSource(ctx, 1, &src, srcLen, &status);
     clBuildProgram(program, 1, device_list, NULL, NULL, NULL);
     kernel = clCreateKernel(program, kernelName, &status);
     clReleaseProgram(program);

     return kernel;
```
The specification is silent on this point. It'd be pointlessly burdensome to require the program to keep around a pointer to a program it's never going to use just to free it later (or to use clGetKernelInfo() to fish it out at cleanup time). Of course, this is the same specification that returns success / failure as an out-param rather than a return value so burdening the programmer is clearly not a significant concern.
Hmm. Actually, it occurs to me to wonder about this question more generally. For example, can I clReleaseContext() after the last time I explicitly refer to my context and rely upon implicit reference counts? Once I have a command queue and my kernels, my program is done with explicit references to the context. I'd like to release it then and forget about it, but obviously I don't want the run-time to free it until all the dependent objects are dead.
Time for an embarrassing confession. I just spent the past many hours producing a cleanly formatted text version of the OpenCL specification. I find text documents immensely preferable for searching, browsing, having open in a terminal window while I work, etc. The OpenGL extension specifications were great that way. Sadly, the core OpenCL spec is all pdf. And, OpenCL is young enough that I can't find manual pages (in HTML, info, man, text, or any form). So, I asked pdftotext to attack the pdf and then cleaned up the results (with extensive automated assistance from my editor). As proof of my questionable judgement, I offer opencl-1.0.43.txt. Ah, well. On the plus side, I also essentially skimmed the entire spec while I was at it, so I've continued to enhance my limited OpenCL expertise.
Ha. Ha ha ha. Appendix D.1 of the OpenCL spec is entitled "A Simple OpenCL Kernel". It has a two line kernel that computes a dot product. It then has 174 lines of host code to run said kernel. Plus a small helper function. Then in D.3 they offer a "simple" reduction that just adds up all the elements in the input. Ignoring how complicated the kernel is (take note, anyone to whom it's not obvious that reduce should be a first class kernel-type alongside map), the host code to run it is over 400 lines long. To add up a list of numbers. Oy, vey, indeed.
The ironic thing, or perhaps just the disappointing thing, is that this is a specification predicated on the existence of nontrivial compiler technology, which is leveraged all over of the place for the kernels themselves. It would have been so hard to do something similar for setup? (I say this with full fear of the ludicrous templated C++ bindings that are already making their way from the abyss to this more innocent world).
In other news, there are a couple of clear places the OpenCL authors were feeling a bit punchy and let loose in the footnotes. Two fine examples, back to back (emphasis added in both cases):
28. The user is cautioned that for some usages, e.g. mad(a, b, -a*b), the definition of mad() is loose enough that almost any result is allowed from mad() for some values of a and b.
29. Frequently vector operations need n + 1 bits temporarily to calculate a result. The rhadd instruction gives you an extra bit without needing to upsample and downsample. This can be a profound performance win.
Important safety tips, thanks Egon (possibly even profoundly important, if you know what I mean).
The OpenCL specification defines SIMD and SPMD, but doesn't mention SIMT. I wonder how much or little NVIDIA fought to include it.

11 August 2009:

In today's edition of reasons I love AMD, downloading the OpenCL drivers requires registering a username and password with AMD Developer Central. Seriously?! I'd be annoyed, but live with it if they just wanted a valid e-mail address, but what am I going to do with another useless username and password for some site I visit once in a blue moon (rhetorical question: I'm going to forget it). I wonder how long it'll be until someone intentionally or carelessly makes the files available somewhere else online.
Bonus hilarity: The top of the Developer Central Registration form says, "Membership in AMD Developer Central is free, but the benefits are priceless." The word 'benefits' is a hyperlink and, curious to learn about the awesomeness that awaited, I clicked on it. The result, a page saying "Error - Page Not Found" and search links to all the pages that included the word benefits. On the other hand, if it turns out this broken link is intentional commentary from some atypically jaded AMD employee on just how priceless the benefits of registration are, I want to give him or her a gold star.
Update: Sigh. It turns out I can predict the future. I already have an AMD Developer Central account. Whose username and password I've already forgotten. Hooray for the password reset / recovery form. Ridiculous. Well, time to download all the packages and cache them locally before I forget and have to repeat this rigamarole.
Oy, vey, is OpenCL verbose. Good grief. It takes pages and pages of code in order to setup, run, and cleanup a single basic kernel. This is absurd. Actually, it's entirely to be expected, but is why I hate standards based APIs.

10 August 2009:

I don't understand how ATI (or the ATI portion of AMD these days) still has any customers. Their drivers are the biggest disaster I've encountered for any piece of hardware form any IHV. I learned more than ten years ago that buying an ATI card was an invitation to disaster-- an unending cascade of buggy drivers that inflicted needless collateral damage even when they limped along enough to get pixels on the screen. About six years ago I started grad school and started playing with programmable GPUs. That meant I had to deal with ATI again (at least, if I wanted any modicum of balance, I couldn't just ignore them). And their drivers still sucked. For multiple years, I and people in the lab played the game of installing every version of every ATI driver we could find in the hopes of finding one whose combintations of working bits lined up in just the right way that our code would run (and it changed, non-monotonically, with every new app and every new driver release). The hilarious thing was that the ATI driver developers were actually more responsive than any from other vendors. We'd report bugs and they'd faithfully reproduce and fix them. And then break something else in the new driver rev that included the fix. I took the obvious cue and used NVIDIA hardware whenever it wasn't absolutely impossible.
Anyhow, as I've wandered away from daily GPGPU programming, my requirements for graphics cards have dropped even lower. I just want multi-mon and basic OpenGL to work. I needed a graphics card for a new machine and we had a big stack of ATI boards lying around (in nice shiny boxes, even) and no NVIDIA cards. This should have been a hint, but I hoped for the best. Ubuntu installed whatever it wanted (the open source radeon drivers) and things appeared to work. Except that under closer scrutiny, GL was pretty flawed. Anything double buffered in GLUT never refreshed, basic line drawing was inordinately chunky and CPU heavy, anti-aliasing was woefully broken / not present, etc. I shrugged and chalked up another victory for ATI.
At SIGGRAPH, though, Mike was bragging about how the ATI driver people have ground-up rebuilt their OpenGL drivers and totally rounded the corner on quality and also had their own proprietary Linux drivers. I was highly dubious, but decided to take him at his word. So, first I asked Ubuntu to install the proprietary (fglrx) drivers. That was exciting-- it produced an X server that crashed on startup and left my display unusable. Logging in remotely, I discovered that I had to manually uninstall the open source radeon drivers to avoid some conflict that apparently was beyond the ability of the maintainers to express with the packaging system (the same packaging system that somehow handles all the rest of Ubuntu, including the proprietary NVIDIA drivers, without problems). Hooray. Except that it didn't work. The driver didn't automatically claim my card and when I explicitly told X to use the fglrx driver, the driver found no supported devices. /usr/bin/aticonfig told me the same thing. Now, this isn't some shady engineering sample board we had lying around the lab. It came out of a shiny, glossy, official ATI-branded box and lspci identifies it as "ATI Technologies Inc R520GL [FireGL V7200]".
But, whatever. I uninstalled the various Ubuntu packages and went to the ATI/AMD website. I told it I had a FireGL V7200 and it told me it had just what I needed. I'm somewhat dubious of installing core modules outside the management of the packaging system because if (who am I kidding, when) it fails, cleanup will be messy. Luckily, I don't have to worry about it. Because the ATI installer doesn't get anywhere near that far. Just after decompressing the driver, it tells me:
```
   Error: ./default_policy.sh does not support version
   default:v2:i686:lib::none:2.6.28-14-server; make sure that the version is
   being correctly set by --iscurrentdistro
```
Right, then. I actually have a reasonable idea of what that means, but what I don't have is any motivation to try and make it work. I mean, seriously, NVIDIA solved the binary driver problem years ago. It went through some teething and rough times, but now it just works. And when it doesn't just work, it just says that it failed or even offers a reason why. I mean, I'm at least as fond of broken error message spew from careless shell scripts as anyone else alive (or at least anyone within five years of my age), but that's just embarrassing.
So, in summary, it's back to BrokenGL courtesy of the open source driver, my long-standing prejudices have a bit more anecdotal grist for their mill, and I've got my eye out for spare NVIDIA hardware. I wonder if Larrabee will be this much fun.

24 June 2009:

There was a message on the answering machine in our office when I arrived this morning. It was a recorded / automated call from SIGGRAPH urging us to register and attend. Seriously?! As Andrew said, this makes me not want to go.

11 June 2009:

Lots of slideware last week. With malice aforethought (or at least conscious aforethought, which is somewhat redundant), I volunteered to fill a FLASHG spot last week and built my PPL poster on the same content: GRAMPS shader extensions to enable reductions (the more developed version of some thoughts I expressed before about combine shaders, amplified in priority by how overwhelmingly common the associative reduction case is). Slides: (ppt) (pdf). Poster: (ppt) (pdf).

15 May 2009:

A while ago, IEEE Micro reprinted the Larrabee paper. Today, I found a thick cardboard envelope in my mailbox that contained a framed / laminated "Certificate of Appreciation" with my name on it "in recognition of the inclusion of" that paper. It's a nice gesture, but feels like much ado about nothing. The timing (i.e., lapse of two months since receiving the journals) is also odd. This makes me wonder if something similar is going to happen with the GPU virtualization paper reprint in OS Review.

17 April 2009:

Argh. The GRAMPS / XPU loader or linker screws up the bss. If I declare an uninitialized global variable, it shows up with an address of 0x0 at runtime. If I initialize it with anything (including garbage) then it shows up with a valid address. Ah well, good to know.
Here's a bad idea. Create a file, for example, sigHandler.cpp. Near the top of the file, add a line #include-ing the same file, rather than its header file (e.g., #include "sigHandler.cpp"). Then try to compile it and stare at the screenful of lines that repeat the filename before ending with "#include nested too deeply". Oops.

14 April 2009:

The final revised version of our GPU virtualization paper is now ready. (Update: 15 May 2009: Updated with the final final version incorporating final reviewer feedback).

13 April 2009:

Back in February, Micah and I found out that, "Extended versions of the best papers from WIOV '08 and VEE '09 will be published in a special issue of the ACM SIGOPS Operating Systems Review journal," and our paper had been selected for inclusion. Naturally, despite high aspirations for what to fix in an "extended version", reality has intervened. Despite that, Micah has made a number of targeted, but valuable, changes and we nearly have a final version (the deadline is Wednesday, so we'd better). In correlated, but directly causally associated news, Micah has persevered (largely on his own aside from moral support / nagging) in navigating the legal byways to post a VMware SVGA Device Developer Kit on sourceforge with documentation, tests, examples, etc. Hopefully we will shortly add the microbenchmarks from the paper to the package.

10 April 2009:

The world makes sense again. Three unfortunate circumstances conspired against understanding. The code is indeed deadlocking, but the deadlock detector short-circuits, and thus fails, in a way that was foreseen as flawed ages ago:
```
   // FIXME(kayvonf): with finite size push queues or loops, inspect
   // bits stay on and we could deadlock
   if (grGlobals.sched.stageMaskInspect.Any())
     return;
```
The second circumstance is that the code only produces visualization information on scheduling changes. A deadlocked application never changes so no further records are logged as soon as things deadlock. The only give away is that the cycle count reported by grampsviz stops increasing and doesn't match the text in the log. Finally, there appears to be some chance, or at least hard-to-predictability associated with which inputs deadlock with an inspect bit on and which don't. Hence the proper detection of deadlock on other images. Now I just have to figure out what, if anything, to do about it (the only possibility is to discover / report better errors, so there's limited upside in spending too long fussing. I do like to fuss, though. And I dislike pathological failure handling.)
The good news, of a sort, is that my machine appears to be immune to Kayvon-esque log file time bombs. It kills any process that tries to grow a file beyond two gigs. Another debt I owe Mike for letting him sucker me into CentOS.
Of course, it actually is sort of good news because the need for a huge log file was a symptom of a deeper flaw. I handed my histogram app a 512x512 image and then went and patiently did other things while it ran. Only it ran and ran. And ran. And ran. And eventually got killed for logging more than two gigs. Sometimes I don't think very deeplly when in scientific method mode, so I responded by turning off logging and assuming the logging was also slowing things down. After rather a while I gave up on that and tried a 256x256 image. After rather a while my brain actually engaged and I took a step back. I timed the 64x64 processing (barely more than a second) and decided that there was no way the slowdown I was seeing at 256x256 (only 16x the data) was remotely reasonable. I looked at the grampsviz output from a run I'd killed prematurely and realized that one of the subqueues was actually filling (presumably the one for zero or 256 since those seem to be the two big spikes in the images I have at hand).
Now, in and of itself, a subqueue filling is a problem, but a problem that should produce deadlock and catch the notice of the deadlock detector. I became suspicious, ratcheted the queue sizes way down (10 slots per subqueue/bucket), and tried the tiny images again. Happily and sadly, the deadlock detector promptly fired. In proper brownian motion debugging, I upped the queue sizes to 100. Still deadlock (160 of the 256 pixels in the 16x16 image are dark enough to quantize to a luma of zero). But, then I tried the 64x64 image and discovered that I have successfully coaxed it into reproducing the bad state the larger images were showing with larger bin sizes. Now I just have to figure out what's going on. I'm sure (or at least, strongly suspect) something fishy in accounting for shader instances that block on push, potentially in conjunction with queue sets or dynamically instanced queue sets. I probably also want a discting indication in grampsviz for shader instances that are blocked, but tying up a thread slot.

9 April 2009:

Hmm. I gave my PPL talk on where I am with GRAMPS and Map-Reduce with just an outline and whiteboards. I'd rate it a mixed success. There was a healthy amount of discussion (it helped that at least one visitor was unfamiliar with GRAMPS and her questions prompted more core GRAMPS motivation / suitability questions during the background) and I think the lack of slides helped with that. There were one or two visual aids where my hand sketching was clearly inferior in ways that affected comprehension, though.
Alex asked a good question: What is Map-Reduce bad at? Map-Reduce is here to stay, so it's valuable to delineate both what it handles readily and for what it is ill-suited. Kunle offered matrix multiply as a weakness. I only commented about graphics, since those are the only apps I would say I know well enough. D3D/OpenGL style pipelines are going to look appealing from a high level, but be awful quagmires without some extensions to enforce ordering. Order pervades them at too many fundamental levels and tagging everything with a sequence number, buffering, and reordering at the end is anathema to how they get performance. REYES, on the other hand, seems nicely suited. In all cases, the current implicit split and partition phases in Map-Reduce align with the graphics fixed function / non-data parallel stages and I think it's possible a good implementation would want something more stateful than a simple function callback, but I'm not sure.
In other news, I stumped Amazon. A search for 'paperback bookcase' returns no furniture. Indeed, it returns nothing other than books. Bookcases optimized for the storage of paperbacks seemed like an obvious niche to me (and a wider search online produces some hits), but apparently none of the Amazon vendors agree.

6 April 2009:

It's so nice outside that I just want to wander or sit around and let everything pass me by. 'Tis not to be, however. Instead, I'm starting to wonder if it makes sense to add a formal 'filter' type of stage or stage qualifier to GRAMPS. Now that basic map-reduce functionality works, I've been fuzzily probing all the areas that may require more thought in the face of any real scrutiny or challenge. It seemed like an easy but rewarding task to write an alternate histogram implementation that replaced the reduce kernel with a combine kernel (summing integers is a problem highly sympathetic to partial reduction).
Frustratingly, lots of little nuisances arose in what seemed like a minor task. First, I wanted combine to be a shader since it's logically as lightweight and stateless as any shader instance (the whole appeal of the GPGPU parallel reduction model). However, partial reduction inherently involves consuming two inputs and the current shader instance environment is entirely organized around processing single elements. I double checked and reminded myself that the combine shader sanity test I'd written had cheated and used a buffer for its accumulation of partial results. With a sparse and dynamic key space, that doesn't seem as kosher.
This didn't (doesn't) seem insurmountable, but prompted me to try to write a thread stage version of combine first. The code was straightforward: bind the same queue as input and output, reserve two packets on the input side and one on the output side, combine the two inputs into the output, and commit. Unfortunately, this promptly deadlocked. It's not exactly a real deadlock, but it's close enough to be trouble. The intermediate-pairs queue now has two producers: map and combine. Additionally, combine creates a cycle since it also consumes from intemediate-pairs. The semantic combine really wants is, "read two at a time from intermediate-pairs as long as map is running, then get a short read once there's only one packet left and map is done." That's a little too nuanced for the current deadlock detector, though. Instead, it eventually combine blocks on a two packet input reservation, map is done, and the system deadlocks. I suspect a reasonable general rule is that a stage that produces to a queue it also consumes should still receive NO_MORE if all its upstream producers complete, but I haven't thought it through in detail yet.
Anyhow, these cascading nuisances have prompted me to wonder if the right solution is a formal filter/partial-reduction shader stage type. In conjunction with queue sets, I'm pretty sure I can fake it with instanced thread stages that accumulate their partial results internally rather than constantly returning them to the intermediate-pairs queue, but that precludes creating multiple parallel instances that all handle the same subqueue for e.g., a logarithmic reduction.
Update: Indeed, faking it with a instanced thread stages that keep the partial result on their stack works correctly. No surprise.
In loosely related news, another of the areas I've subjected to fuzzy probing is the push coalescing / utilization schemes. Map-reduce is deeply push / vout / coalesce rooted. Map, especially, generates intermediate pairs with no expected coherence in terms of spatial or temporal proximity of keys. The current GRAMPS push coalescing has a single local push compaction buffer per-queue and has to flush any time an instance pushes to a new subqueue. This means that in practice setting the packet size of the intermediate-pairs queue to N elements results in slightly higher than 1/N utilization during histogramming (it depends on the input image. Images with large regions of pixels of similar intensity do better).
My preliminary thinking is that rather than compacting in local store and copying into the queue, there are at least some situations where the push compaction should just be done in-place in each subqueue and the dispatch thread should maintain an array of windows per subqueue (or cache of windows per some large number of subqueues). It absolutely makes sense in the case of reduce, where the downstream isn't going to run until all data is available, so there's no pressure to get data committed early. I think it makes sense more generally whenever pushing to an unordered queue. It may cause fragmentation in the internal queue implementation, but since there's no head-of-line blocking in unordered queues, the increase in packet utilization / density should be the significant one.
I would just like to record for posterity that if one is foolish enough to run an arbitrary program that Kayvon gives him, and foolish enough to run it using the shared home directories instead of a local disk, and it happens to generate a multiple-hundred gigabyte log file and fill the combined quota for all of Pat's students, then one will apparently get ZFS in such a bad state that it won't let one (or anyone else) remove said log file (or anything else) and thus hamstring any of Pat's other students foolish enough to depend on shared storage. For extra credit, one can arrange for all of this to unfold while John is driving between Santa Barbara and here and thus unavailable.
There are many morals to this tale, but honestly, the schadenfreude factor is really swamping them all for me. I've just smiled as the various confused, frantic, and resigned emails have gone around. Well, and I tried to help Solomon a little, but quickly established that there wasn't much to be done. For my part, I didn't even notice any hiccups, though I suppose I would have if I'd tried to write this earlier or checkin outstanding changes.

3 April 2009:

Change 1000 for the GRAMPS tree. And what a change it was:

   ------------------------------------------------------------------------
   r1000 | yoel | 2009-04-03 16:40:43 -0700 (Fri, 03 Apr 2009) | 27 lines

   Make visualization of dynamic instancing actually work.

   All my prior changes to dump and visualize workloads with dynamic instancing
   were braindead.  In their defense, they didn't segfault and they worked if
   the graph only had a single queue.  That turns out not to be terribly
   helpful, though.  Now I've added enough information to each VizRecord to
   determine the correct starting offset of each queue in the global list of
   subqueues and fixed up the surrounding code to honour it.  The grampsviz
   images for histogram make sense and provide value now.

   ...

Oops.

Sometimes I really hate C++. That is all.

2 April 2009:

A high quality useless technical exchange. I realized that in order to histogram images, I needed to linearize the pixels (or else produce separate r, g, b, a histograms, but that was unappealing). I immediately resorted to my default: the perceptually weighted total of r, g, and b. That ignored alpha, though. While this didn't seriously bother me, I gave it a little thought and considered scaling the total by alpha. Andrew pointed out that this just dilutes the result as if there were a black background and that truly the appropriate thing to do would be to tally a fractional contribution of alpha towards the appropriate histogram bucket. It's a good point, but I'm sticking with integers. In other news, I think my map-reduce histogram actually finally works end to end now.
Uh-oh, the sky is falling. One of my officemates taped some junk mail to the ceiling (he's building a camera and the apparatus is precarious enough that the lens has to rest on the circuitry pointing straight up). After a week or two, however, the tape has begun peeling and the IEEE Benefits of Membership flier is now hanging precariously from one corner. Update: This sky has fallen. He jumped up and pulled the whole thing down.
In other high quality facilities news, GCafe stalled for a good five minutes this afternoon. The reconstructed story: the building carpets are being cleaned in the evenings this week. Last night they cleaned the third floor. For highly dubious reasons, the same circuit the janitors use to power their equipment happens to be the one that drives the plasma television and projector in the graphics lab conference room. Shampooing the carpets blew the circuit breaker (which it apparently does routinely), the janitors moved to another plug (also apparently routine), and we were stuck figuring out what had happened and then tracking down the breaker panel in question (over next to the water fountains, nowhere near the conference room). Or rather, John was stuck doing that while we gave up and used a portable projector someone had.

30 March 2009:

I gave in and put a split stage above map. I suspect there's a serious shake down coming to reconcile the packet and element types in all the queues once I can actual exercise the code with real data. I've chosen a simple image histogram app to bringup map-reduce (split chops image into n pixel chunks, map emits pairs of the form (quantized pixel value, 1), and reduce (or combine) sums the columns. It's just word-count in disguise). However, I've gone about as far as I can without delving into the implementation of push, so I need to stop tiptoeing around it and stare at the code long enough to figure out how it expects to be used.
By the way, a couple of weeks ago, I found something completely unexpected in my mailbox here: a handwritten and thoughtful thank-you letter for my trip to Davis. It was an eminently classy touch that I think deserves credit and acknowledgement (though, at some point mutual appreciation becomes absurdly recursive).

23 March 2009:

Hindsight obvious: A shader stage at the top of a pipeline / graph doesn't make any sense. Without a queue from which to pull (or a stream to process in more standard parlance), there is no way to instance.

16 March 2009:

It's journal day! I was surprised to find two sizeable packages in my mailbox at Gates, today. One held three copies of IEEE Micro's "Top Picks from the 2008 Computer Architecture Conferences" issue and one held five copies of the January, 2009 issue of TOG. I was expecting the copies TOG (one for each author now that the GRAMPS SIGGRAPH submission has finally transubstaniated into publication), but IEEE Micro was a surprise. They've included the Larrabee paper (which I knew) and apparently that qualifies me for three hard copies (which I didn't know). Pat got three too, but asked to take mine off my hands since people keeping his copies. Kayvon immediately complained (again) that the ACM people "totally messed up the fonts" in the GRAMPS figures. Naturally, I can't really tell.

9 March 2009:

It's always satisfying when someone else nails a pet issue of yours. Joel Spolsky, while describing how to be a successful program manager, offered this exchange:
```
   Here's something that has happened several times: a programmer asks me to
   intervene in some debate he is having with a program manager.

   "Who is going to write the code?" I asked.

   "I am..."

   "OK, who checks things into source control?"

   "Me, I guess, ..."

   "So what's the problem, exactly?" I asked. "You have absolute control over
   the state of each and every bit in the final product. What else do you
   need? A tiara?"

   You see, it turns out that this system puts the burden on the program
   manager to persuade the programmer, because at some point, the program
   manager runs the risk that the programmer will give up and just do
   whatever the heck the programmer feels like.
```
I encounter this dysfunction / dissatisfaction with shocking regularity. Programmers, senior well-respected programmers, express feelings of powerlessness in the face of product management, program management, regular vanilla management, etc. I always challenge them back that such powerlessness is only possible if they abdicate their inherent last word on product development. Fundamentally, engineers take their cues and strongest influence from their peers and leads. Those are the people surrounding them all day for advice, discussion, code reviews, mentoring, etc. They may have a weekly one on one with their managers and see him or her in a group meeting or two, but that's an order of magnitude less contact and social influence. Product management, program management, etc. are all even further away. Moreover, the whole reason "the organization" gives engineers fancy titles (tech lead, architect, distinguished engineer, staff engineer, principal engineer, etc.) is to convey both a recognition and trust of their judgement as well as an expectation that they will use it (I'll note that some, but definitely not all of the time, I approach the managers, PMs, etc. about the flip side of such situations, they express an eagerness for the engineers to step up and show more leadership. It's the absence of that leadership that has them knowingly doing an uncomfortable job covering rather than leaving a total vacuum). I think I'll start incorporating, "What else do you need? A tiara?" in my future exhortations to dismayed engineers that they can empower themselves as simply as ceasing to abdicate their influence.

6 March 2009:

Mmm, syndication. When he was here a month ago, herr jrk observed the lack of any feed or other automatic mechanism for discovering updates. A year or more ago, I'd made the same observation and dug into RSS to see if I could easily accomodate it without major nuisance. I came away disgusted with the crudity and inelegance of 'syndication' schemes and relegated the problem to the back of my mind. Jonathan's reminder nudged the priority upward some more and I've tentatively reached rapprochement with Atom. Hopefully I'll remember to update the file when I add notes and hopefully the feed will be marginally useful even though there are no timestamps or summaries (the titles serve as reasonable timestamps for human consumption, but probably won't help consuming software).
Made it to Davis without any trouble or inclement weather. I made the trip home more exciting than it needed to be, but luckily that was after everything was over. I got to hear what various folks up there are doing and the lecture went okay. Pretty passive crowd. At least one person followed well enough to ask a significant question at the end, though. And that's despite the ghastly blurring the projector produced in the face of all efforts to persuade it otherwise. Here's the slides in non-blurred form: (ppt) (pdf). If I were to make revisions, I think I should flip the order to be 'picture version' before 'text version' for concepts that are presented both ways and I need to cycle back in some discussion devoted to scheduling, either in the design or evaluation segments.

27 February 2009:

Oh, to be wanted! I received an email this afternoon that opened:
```
   I want to first apologize for my intrusion. My name is ---- and I
   am a Staffing Specialist with NVIDIA Incorporated (Nasdaq: NVDA). I'm
   working with Senior Directors of one of NVIDIA's most visible and
   prestigious teams.

   We are looking for a
   CUDA DEVTECH ENGINEER #1136737
   ...
```
I actually receive so few unsolicited recruiting emails that I still paid enough attention to judge this a pretty lame effort. There were clearly manual steps to this discovery / screening process that produced my name and the best he can do is devtech? Without even mentioning the other CUDA related openings NVIDIA's web page lists? I guess he only gets paid for hiring devtech, no matter how unlikely the fit. And that's seriously all the cover letter / introduction he can spare? It actually makes the email less appealing than if it were omitted entirely. The NVIDIA marketing youtube links in his signature are a nice touch, though.

26 February 2009:

In other, more productive news (well, sort of), I spoke at GCafe today. I used the obligation as an excuse to continue refining my proposal / story for GRAMPS and gave a talk on the background and goals for, key design decisions in, and planned metrics / deliverables for GRAMPS (ppt) (pdf). I think I did something wrong, though, because the audience was mostly friendly and fairly quiet. I guess graphics people are less willing to challenge my sweeping characterizations of how to confront commodity, heterogeneous, many-core hardware than other systems people. Now I just need to put together my lecture for next week's visit to Davis and I'm once more low on excuses to avoid doing real work. Well, I suppose there is the minor need to continue iterating with Pat to nail down the GRAMPS proposal / story and intended execution.
Sigh. I wanted to write a simple MapReduce program in perl only to discover that it's very hard to find a MapReduce perl package*. This has to mean perl is officially uncool. So I gave in and wrote one. It wasn't that hard (I just wanted the syntactic sugar, not an efficient parallel implementation).
*Technically, CPAN has a Parallel-MapReduce, but like many fledgling efforts it manages to be both massively over-engineered / complex and not actually able to express the usage I want. Alas.

10 February 2009:

The January 2009 issue of TOG has made it to light and we're official! It's amusing to note that the article says it was received June 2008 and accepted August 2008. I wonder what limbo it was in transitioning from SIGGRAPH to TOG before June.

6 February 2009:

And now I have initial rough support for (statically) instanced thread stages. I should associate instances explicitly with input bins (subqueues) and be smarter about which instances tier 1 tries to schedule, but suitably defensively written code works fine now. Next should probably be dynamically instanced subqueues, and hence threads.

5 February 2009:

Writing code these days is correlating inversely with writing prose. I don't know if it's the nature of the code, the lapse of the prose writing habit, or time pressure, but it's too bad. I had a good hour plus conversation with Jonathan yesterday about his work, vision / roadmap for GRAMPS, and more generally how GPUs and graphics pipelines should trend and what's important to figure out in pursuit of that. I also gave a FLASHG presentation about thoughts and plans for GRAMPS (pdf), though it looks a lot like the presentation from the PPL retreat. That seemed like a better organizational framework than a status report of what I have implemented and am planning to implement in code-oriented milestones (which would have made sense to few people and interested even fewer). It would probably be good discipline for me to write such a list, though. It would help remind me to be less distracted with open ended refactoring and improving of the code.
Anyhow, after the PPL meeting today, we again tilted at the windmill of parallel taxonomies and required building blocks. I frequently complain that producer-consumer gets overlooked when the parallelism in an application is superficially reduced to what's "task" and what's "data" parallel and that building blocks for the producer-consumer case are a requirement for any lowest level parallelism substrate so just talking about tasks and vectors misses out. Pat justifiably complained that producer-consumer is orthogonal to the abstract academic definitions of task and data parallelism and has prompted me to reevaluate my phrasing once again. Here's what I ultimately wrote:
I suppose I'd rather take a step back from a single task/data dichotomy and say that the taxonomy of significant parallel primitives includes four things:
- A type of kernel that can be called 'data parallel' that takes as input a vector, expected to be relatively long, of records and can process all of them with the assumption that there are no dependencies between records. I personally allow this kernel unordered variable outputs or ordered statically fixed outputs, but the stricted definition only permits the latter.
- A type of kernel that can be called 'task' or 'thread oriented' that takes as input arbitrarily structured data and generates whatever output it wants in whatever order it wants. In many systems, 'tasks' are allowed to create/run new tasks and even new data parallel kernels dynamically at run-time.
- A mechanism for specifying dependencies between task or data parallel kernel invocations such that the 'system' is guaranteed not to start a kernel until all input dependencies have been satisfied. In its most basic form, this is an implicit barrier embodied in the main / master loop of the program as it explicitly creates tasks, but explicit barriers and explicit dependency frameworks are also fairly common.
- A mechanism for allowing producer-consumer data exchange between two tasks, two data parallel kernels, or either mixture of the two. Specifically, a pipeling mechanism where the consumer can be started before its entire input stream is generated and incrementally request / consume more and for a producer to create / push output without terminating and with flow control / blocking that is explicitly controlled by the application or implicitly handled by 'the system'.
I've also drawn a few generalizations about how things seem to work out 'in practice':
- The producer-consumer element of the taxonomy is often dropped in parallel programming models and toolkits. It is indeed entirely orthogonal and interesting programs can be built without it. However, a creation-time dependency scheme alone is insufficient to efficiently express and execute some producer-consumer workloads.
- I've observed two distinct ways that parallel workloads can behave. There may well be more in apps I haven't yet studied. The first is exemplified by the canonical grid-based scientific simulation: the brunt of computation is on extremely long lived data that persists in-place in memory while ephemeral tasks and data parallel kernels manipulate it. Often the central data structure is a grid, but it can just as easily be a tree or anything else. The key feature is that very little intermediate data is generated that persists beyond a single task / kernel, but is ultimately consumed. The second is exemplified by the graphics pipeline: the inputs and framebuffer (outputs) are long lived, but vast amounts of intermediate data is generated and entirely consumed in between. Moreover, the bandwidth costs of spilling this data off-chip are prohibitive, so it cannot be accumulated at each stage of the pipeline in its entirety and then advanced to the next stage / pass. In other applications of this type, the total storage cost can be as much of a problem as bandwidth (consider e.g., the amount of tesselated geometry and intermediate samples in a high resolution ray tracer or film renderer).
- Locks aren't actually good primitives to expose. They can be used as tools to build good primitives, but your average parallel application author should never have to directly manipulate one. In the in-place long lived data workloads, the data is generally partitioned with the communication along the interface mediated by dependencies / barriers rather than a cascade of individual locks. In the pipeline schemes, the work queues themselves provide the brunt of the synchronization, often in conjunction with binning / partitioning schemes like those that handle the interfaces for the in-place workloads. The same can be seen in MapReduce where locking is implicit in a number of places-- buffering pairs, sorting by key, partioning the input-- but never left to the user. Tim has described what he calls generational data structures that he says are common in games and which also avoid locks for synchronizing updates to shared data. Data bases use transactions (and the transactional memory folks would have us all use them). Etc.
On a vaguely related note, I think I have finally wrapped my head around what Pat calls a data parallel domain specific language. The longstanding stumbling block has been my under appreciation of the role of the lambda functions / element-wise filters applied with every map, reduce, over, group-by, etc. operation and Pat's insistence on calling it data parallel. I see it as a domain specific language for any parallel program that can be expressed as a static computation graph (or expression tree if you perceive each stage as an operator the way Pat does). It's really the ultimate generalization of all computation graph based systems. Where streaming, at least as it has been adopted, only includes 'map' shaders and GRAMPS supports 'map + push' shaders and threads (the degenerate case of applying a trivial vector operator to the single length vector and putting all the work in the lambda / filter function), this language supports a whole spectrum of distinctly typed stages. I'm curious whether it's significantly more expressible than GRAMPS (or map-reduce, or any of a number of other systems that have consciously reduced the space). If it isn't, then it's really a poster-child domain specific language. The whole thing can be implemented on a more minimal set of primitives while exposing richer higher level primitives upwards. If it is, it would be interesting to know if there's a viable fast implementation of the missing part or whether it inherently doesn't map well even to the bare metal.

19 January 2009:

Alright, growing reservation windows works now! I'm also on much better terms with the runtime and queue bookkeeping code than before. On a more ironic note, I realized that I can use the short reservation support in conjunction with math on the window size to fulfill the same purpose that drove me to finally fix window resizing. Ah well.

16 January 2009:

It also turns out that deleting your sanity check assertions when they fail because you suspect they're too strict is a bad idea. Digging into what's going on enough to realize two of your callsites are subtly violating your expectations is a much better idea.

15 January 2009:

It turns out that being a hypocrite is a bad idea. Despite long espousing their merit, I only recently got around to writing a harness and some formal precheckin tests for GRAMPS. And, guess what-- it promptly revealed that I'd broken push queues in the CPU-like configuration a while back and failed to notice. Hooray for hubris. Now it's time to add a few more tests.

9 December 2008:

Things that are awesome (and completely non-technical): I wake up in the morning and discover than another governor of Illinois has been arrested, this time for trying to auction off a seat in the US Senate (in fact, his predecessor is still in jail serving time for racketeering and fraud). The Wall Street Journal provided a quotation from the FBI's news conference that made me laugh out loud: "If [Illinois] isn't the most corrupt state in the United States, it is one hell of a competitor." I feel such pride.

4 December 2008:

Darn that Kayvon! It looks to me like the code for growing a GRAMPS window had always been subtly broken, but since we've never exercised it, we've never noticed. If you request a reservation of size a, then grow it to a reservation of size b, almost everything works correctly, except that the innermost bookkeeping for the queue ends up recording a total reservation of a + b rather than discounting the new reservation by the already outstanding amount. Since I initially tried testing with small sizes, the problem was further masked (it was easy to attribute the off-by-one to my code. When I cranked up the data size, it was significantly harder to blame myself for the off-by-many).
The problem is minor. It's unsettling, however, to discover that extensively used and seemingly correct code is wrong. I would have taken a not-implemented in stride, but this manifestation has more of the "I found a compiler bug" feel where I become suspicious of everything and everyone. Luckily there are meds for that.

24 November 2008:

This is not the prettiest code I've ever written:

#define PRODUCEFN(fn) \
   static void fn(void); \
   void (*_MRProduceFn)(void) = fn; \
   void fn(void)

#define MAPFN(fn, input) \
   static void fn(const void *input); \
   void (*_MRMapFn)(const void *input) = fn; \
   void fn(const void *input)

#define COMBINEFN(fn, key, v1, v2) \
   static void fn(const void *key, const void *v1, const void *v2); \
   void (*_MRCombineFn)(const void *key, const void *v1, const void *v2) = fn; \
   void fn(const void *key, const void *v1, const void *v2)
#define REDUCEFN(fn, key, values, num) \
   static void fn(const void *key, const void *values, int num); \
   void (*_MRReduceFn)(const void *key, const void *values, int num) = fn; \
   void fn(const void *key, const void *values, int num)

Hopefully this means I'll think up a clever way to do it on the way home tonight. Otherwise I'll have to actually clean it up and live with it. And that doesn't count the weak symbol definitions I threw into the bottom of the header file to make it work out when an app doesn't define all the entry points.

21 November 2008:

Also at the PPL Retreat, I gave a GRAMPS talk (while Ian and Mike asked challenging and confrontational questions). It was good motivation to take a stab at articulating MapReduce and Cloth Sim and to start working on calling out producer-consumer parallelism as distinct and valuable above and beyond the more widely perceived strategies of task and data parallelism. I'm still coming to appreciate just how lacking it is in the body of conventional big iron / cluster scientific simulation and other apps and thus how under-appreciated / recognized it is in many parallel toolkits, languages, programming models, etc. It's so central to the air one breathes doing rendering that only recently have I realized how often it is overlooked elsewhere. I'm also biased about the applications I think are relevant for a mass market / commodity device benefit from pervasive parallelism. Unsurpisingly, I think those apps involve a lot more producer-consumer parallelism than is learned looking at big parallel scientific simulations or Sequoia.
Ian had some good comments when we talked more after dinner and I've even post facto modified my slides a little in response. Mendel asked me if GRAMPS supported task parallelism and I realized that it fits really well. Task parallelism is just an execution graph where the what's queued is activation records for sub-tasks rather than actual data. Similarly, data parallelism is just an execution graph where all that flows through each queue is a single NO_MORE/ALL_DONE as kernels complete. Pat liked my characterization of task-parallel as Divide and data-parallel as Conquer (which, by the way, I claim is truly accurate, not just a snappy rhetorical device).
Someone from NVIDIA spoke at the PPL Retreat, and then a group of us discussed in more detail over lunch, about whether SIMT (NVIDIA's name for its architectural execution model) is materially different from a multi-thread SIMD vector unit where the SIMT warps align with vector threads and the SIMT intra-warp threads align with SIMD lanes. It was a slightly unsatisfying conversation, though. Not because we disagreed, which is understandable, but because I didn't feel like there was success progressing towards identifying where/why we disagreed.
I think we explicitly agreed that from the CUDA/source code perspective there was no way to tell the difference whether the compiler or hardware implemented the vector masks to deliver a scalar environment, but he also emphatically stated (1) that the two were fundamentally different and (2) they were disinclined to tell us why the two were different or offer any insights into choosing one design over the other. He stated that any further how or why details got into NVIDIA 'secret sauce' and it made no business sense to reveal anything further since it would help their competitors. I found that very surprising-- how do you persuade people the merits of an architecture without complementing 'what' with 'why'? Well, perhaps not pure end users, but surely a community of designers and researchers examining a variety of choices not only to use, but to which to contribute, for which to build, etc.
My experience and instinct argue there should be a qualitatively meaningful middle ground for providing useful high level information without giving away competitive advantage. Is there truly a trade secret that is so simple, so central, and yet so unpatentable that protecting it prevents offering any insights? I wouldn't expected that and it would make it very awkward for satisfying my honest and good faith intellectual curiosity about design choices and the corellary lessons to be learned.
(A note: SIMT does offer the potential for an implementation that can be discerned from threaded vector. There is no inherent limitation that only one program counter within a warp execute at once. At Kurt pointed out last year, one can easily conceive of a SIMT implementation that can dispatch multiple different instructions per cycle in the event of a divergent warp. It remains an interesting question whether such an implementation immediately trends towards as many independent instructions as threads in a warp or whether a warp still reflects an interesting execution coherence grouping.)

11 November 2008:

Is CUDA more than just a fancy way of writing a parallel for() loop? Christos asked me that earlier and complained that people from NVIDIA always denied it, but never seemed to have an explanation of why it was untrue. I told him I thought it was a reasonable mental shorthand. In fairness though, or at least in theoretical nuance, CUDA is more than a basic parallel for() loop. It is SPMD, but there is no constraint against conditionals in the program. So, one could quite conceivably write a CUDA kernel where threads 0-31 all functioned as one stage in a software pipeline, thread 32-63 the next, etc. Well, except for the very rudimentary atomic and repacking support. In practice, there are a lot of implementation constraints that would make that awkward (any model that penalizes threads for using any stack or more than a handful of scalar registers tends to favour very basic code sequences such as simple loop bodies).
Now, obviously one could implement exactly the same thing with a parallel for() loop (use the loop index as the thread ID and paste in the same kernel body). I also think the resource footprint and practical communication constraints imposed by current GPU shader core architectures are a good match for the expectations embodied by the conversational shorthand, "basic parallel for() loop". However, I understand how an argument that CUDA is more than that can be defended on marginally technical grounds by a particularly dogmatic partisan.
Note that despite that, I think CUDA as a programming model has actually demonstrated itself a valuable contribution to commodity parallel programming. Despite my teasing and longstanding dismissiveness, I sincerely hope Intel implements a "scalarizing shader/kernel compiler" that runs CUDA-like kernels on Larrabee and NVIDIA, Intel, or someone makes a run-time available for out-of-order x86 cores, too. For the class of apps that it admits, it offers a relatively simple abstraction to developers that admits highly parallel and efficient implementations on widely available hardware. Deep conceptual work or not, that sounds significant to me.

I mentioned something to Solomon earlier that has caught in my head while listening to the many real and hypothetical task parallel toolkits and programming models being bandied about. It shows up both with commercial products like CUDA and Thread Building Blocks and academic discussions like Solomon's parallel python presentation and Sequioa. There seems to be a widespread assumption that "tasks" only communicate at entry and exit and/or that kernels operate on their entire dataset in one (nearly) contiguous pass. That is, applications are sequences of tasks/kernels and dependencies that are dispatched first to last by the system. Two tasks may be resident (actually executing on available cores) at the same time, but it is just a load-balancing / machine-packing mechanism. In fact, there is very little support for producer-consumer parallelism. This took me a while to articulate because FIFOs and co-resident stages are such a fundamental assumption in graphics pipelines, but they are apparently not the obvious/ground truth requirements for a parallel system for which I took them.
Upon consideration, this seems to reflect two different divergences from my world view: operation on in-place/persistent data and the ephemerality of tasks. The former is easily shown with an example from cloth simulation. Such simulations (or at least, many such simulations) are based on a preset point-spring system. The first kernel of each time step visits all the points, performs local force computations, and computes each point's proposed new position and velocity. In that sense, it's like a vertex or fragment shader. However, unlike transformed vertices or shaded fragments, the grid of points persists statically across every time step. Transformed vertices are queued and discarded after rasterization, but "transformed" cloth points are the input in the next pass. Additionally, they're on a preset, pre-allocated grid that is not even repacked, let alone variable length. Thus, it makes vastly more sense to assume that a task, or set of tasks, will operate in-place on the grid rather than consume it and queue output downstream. In contrast, when a kernel (or set of tasks) generates a variable number of outputs or transient intermediate results dynamic queueing handles allocation and scheduling the producing and consuming tasks co-residently saves huge amount of bandwidth and storage from spilling all the intermediate data and recirculating it. Given the longstanding emphasis on iterative simulations and solvers and the relative inexpense spilling to memory in past networked clusters compared to current chip multi-processors, I can see how co-resident producer-consumer parallelism has been overlooked.
The other divergence I see is that other parallel systems-- task or data-parallel-- assume tasks are ephemeral. Specifically, they assume that preemption is never necessary: with only minor/short-lived disturbances, load balancing can be achieved by waiting for tasks to finish and appropriately placing new ones. This has a major consequence: no task-local statefulness. That's an abstract mouthful that deserves an example. Consider a kernel, task, or set of collaborating tasks that want to perform an operation that spans multiple input elements: e.g., computing the median of all inputs, sorting fragments in depth order and resolving transparency, or even just caching the last N inputs or intermediate results for some purpose. There are two choices straightforward choices for scheduling such functions: defer launching them until the system is certain all of their input is ready or launch as soon as the first input arrives, but deschedule/preempt when more input is temporarily unavailable. Only the former is compatible with ephemeral tasks, but requires potentially huge bandwidth and storage to spill and recirculate (much like the variable output problem above).
There's a third choice that advocates of task systems (or users who happen to be stuck with them) suggest: fake the latter by blocking the input into chunks and running a series of sub-tasks that compute a partial reduction and spill enough internal state for the next task to pick up and continue. There are no shortage of reasons to complain about this "solution", mostly the standard event/continuation warts: sub-task management and bookkeeping ends up dumped on the application writer, the internal/continuation state needs to be synchronized with some ad hoc locks, barriers, etc. I prefer a simpler objection, though: preemption of threads is not a difficult problem and is one long since solved (it is even routinely built into processor cores these days). Why inflict ugly program transformations on the programmer to fake a semantic that is so easily implemented?
I think both of these "weird" (to me) expectations reflect of a history emphasizing networked clusters or parallel supercomputers that were rich (at the time) in per-node resources. And, also likely a heritage running offline applications. In both cases, the relative costs of spilling even megabytes of intermediate buffering to seconds (or minutes or hours) or kernel computation time are acceptable. At real-time and interactive rates, there are just no resources: GPU designers tell me they have been off-chip bandwidth limited for multiple generatiors and they already go to extreme lengths never to spill intermediate data from on-chip FIFOs. I suspect that at real-time rates, the ephemerality assumption for load-balancing can also become questionable. It clearly holds for shader-like tiny thread instances, but I remain skeptical about launching oodles of short-lived tasks rather than few longer lived tasks that dynamically produce and consume on queues. I realize the two models are duals and it is possible the compiler and programming language folks can transform between them effectively, but it seems more likely to map the latter onto actual hardware with reasonable overheads. I am also unsure what the most comprehensible task-dual is to specifying queue/buffer lengths (obviously a limit on task spawn/fan-out, but since tasks cannot just block if they try to spawn while at the limit, it seems potentially ugly).
One concluding note: one can get a lot of mileage out of dependency chains of ephemeral tasks which only communicate at entry and exit. That is, these models evolved for good reasons that still hold. For that matter, CUDA is an even more constrained model-- there is not (yet) an explicit dependency mechanism and the tasks are confined to trivial data-parallel kernels-- and it is widely gaining adoption by people who find it fits pieces of their workload (or can be shoehorned into roughly fitting and the performance plus simplicity trade-offs appeal). I have even found that GRAMPS needs an analog of the dependency/barrier mechanism for the cases where it makes sense to operate in place rather or a reduction needs to know when it has seen all relevant input. However, I am surprised by the apparent blind-spot to recognizing the locality advantages of producer-consumer parallelism between co-resident threads. It seems like a straightforward extension to task parallelism whereas trying to infer it automatically and reconstruct it via task-fusion and program transformation seems very difficult.

It qualifies as ironic that after all the heckling I've given CUDA about having barrier as its sole synchronization primitive, it is emerging as a fundamental addition to GRAMPS. However, it's more of an amusing ironic than an eating crow ironic. While many applications-- multi-pass graphics, Map/Reduce, cloth simulation, REYES-- all need or can profit from barrier-like functionality, it's for synchronizing between entire kernels/stages/phases of computation rather than stray intra-kernel thread groups. Additionally, queues were still the first synchronization primitive and barriers supplement them, not supplant them. Barriers are also somewhat of a syntactic convenience as it's possible to fake them with additional queues and an a thread stage that watches for arrival. In fact, I expect that they'll often manifest as app-injected NO_MORE tokens through existing queues rather than solely as an entirely parallel (heh) dependency mechanism.

7 November 2008:

Reading other people's papers about programming models has made me recognize a critical 'success trait' that they and I have done a bad job articulating: optimizability by expert developers. Everyone knows (and says) that a programming model should be easy for developers, but that does not only mean easy for novices or easy to write the first version of the program. Everyone also knows and says that a programming model should map onto an efficient hardware implementation (and, if they know Kurt, that a programming model should also influence future hardware implementations to run it even better)-- something I often casually phrase as, "the first version of the program should come close to the best version of the program." However, in practice the first version is never going to be the best version and the systems that gain popularity are those that give motivated developers the tools to optimize their programs. From a social perspective, this is natural: the transition from novice to expert comes from understanding a system well enough to manipulate and optimize usage. If there is no understanding to gain, all users are perpetual novices and there is no pleasurable reinforcement nor punitive stickiness from becoming skilled.
Any programming model brings generality (if it didn't, it would be an application, not a programming model). And, generality is directly opposed to the programming mode/run-time/hardware being able automatically to maximize optimization. Instead, I would claim that any programming model that has been successful/popular has included explicit mechanisms for its power users (i.e., savvy developers) to tune, guide, and/or optimize. Even programming models that emphasize ease over performance-- e.g., many scripting language and probably all managed code environments-- develop extensive practices and features for enhancing performance. These show up consistently as the same thing: guides for the progamming model to exploit its own performance features better. Map-Reduce implementations tend to let power users control the number of map and reduce tasks and the partitioning in case the best effort black box strategy for load balancing can be improved. The many-core versions expose a knob for appropriate batch sizes so that users can tune based on the (non-input) working set sizes of their map functions. Similarly, GRAMPS lets users specific maximum queue sizes as explicit load balancing hints to the scheduler. CUDA makes explicit the size of thread group launches because it's a correctness requirement for some use cases of the shared cache, but it's also an optimization parameter for balancing number of groups versus group size in many others.
A completely separate observation: GRAMPS is not alone in leaving it a separate problem to write good kernels. All of the parallel programming models about which I'm reading are focused on structuring the skeleton the application for performance and abdicating a discussion or ownership of optimizing kernels (whether they're called kernels, shaders, mapper/reducers, tasks, etc.). This is also entirely natural: when the focus is on obtaining speedup through parallelism, obtaining speedup in the various sequential instances is an orthogonal problem. The sequential bits are also common too and richly mined by the existing work in the context of serial execution. Natural or not, it is nice to see it confirmed de facto (none state it outright) by other work in the area.
There's a boatload of "we implemented Map-Reduce on XXX" papers in the last few years. It's new and different to be researching something that's comprehensively discussed in the mainstream literature. Collision detection is murkier-- there are boatloads of those papers, but nothing terribly recent and algorithmic. Certainly nothing concretely about parallelization. On the other hand, I found a stealth Larrabee paper about rendering physics that includes and implication of many-core-ifying collision detection as part of cloth simulation. That prompted the hindsight-obvious realization that it makes more sense to look for (still fairly simple/synthetic) workloads that incorporate collision detection rather than looking at it in abstract isolation.
In other news, the VMware and Stanford slices of my life blurred for a week or two while I helped Micah prepare a paper on GPU virtualization for the First Workshop on I/O Virtualization. Weirdly, the Industrial Practice track disappeared and the paper is now lumped under "Network Virtualization". Equating virtualization I/O (or operating systems I/O in general) with networking and storage only is such parochial thinking. (This is more a commentary on the submissions than the workshop, since the program committee obviously can't fabricate tracks without content).
Update (4 Dec 2008): We're received confirmation that the Industrial Practice track dissolved and our submission is a regular part of the conference. Poof.

3 November 2008:

Doing background research on programming models and run-time systems for parallel architectures, I discovered that Cilk has a Multicore programming blog. That's probably near the top of the list of ways you know you've become a sold-out parody of yourself. Oh well. I suppose the "Resources for Multicoders" page, or any use of the term "Multicoders" is also a strong indication.

14 October 2008:

This time for sure. The TOG paper now has an acknowledgements section. Thankfully I realized it was missing before the paper was too far into production to adjust.

10 October 2008:

Post-TOG, post-High Holy Days, and generally as things start to calm down, I've begun drafting a GRAMPS roadmap (in parallel with chasing some obvious desirable coding changes just to get my hands dirty again). Hopefully, it'll seem as reasonable to the readers outside my head as it does to the ones inside.

6 October 2008:

Upon challenge, Kayvon has proudly claimed that much better than any wimpy side-effect free programming is his favoured scheme: side-effect only programming. Of course, this is merely after-the-fact confirmation for any discerning reader of the GRAMPS scheduling and graph analysis code.
On Friday evening, I hacked up a 'favour the front of the pipeline' scheduler option for GRAMPS. It essentially the reverse of the default scheduler: stages are statically prioritized based on their proximity to start stages. I ran it with the CPU-like configuration and immediately ran smack into an familiar problem: head of line blocking. Pipeline bubbles would periodically develop when one core / thread's scheduler would claim some fragment work for its scoreboard, but then be deluged with enough vertex work that the fragment work languished undispatched. Eventually, the blend stage stalled because of ordering. It needed the undispatched fragments before it could handle all of the other pending work. Natural solutions are work-stealing schedulers or else schedulers that voluntarily relinquish work. In theory the scheduler could also queue less work, but ultimately that leads to frequent scheduling decisions and those are high overhead without dedicated hardware. That might be a reasonable approach if combined with large packet sizes (or dispatching multiple packets at a time), though.
The situation's a little different for the GPU-like configuration. Its scheduler is already synchronous. That amounts to queuing less work and isn't very susceptible to the head of line blocking scenario above. That said, it develops on pipeline bubble I can't currently explain with confidence. Here is the queue and core occupancy for a 1024x1024 rendering of the courtyard scene. I don't fully understand the bubble that develops about 60% of the through in conjunction with the late spike in RO work. That's downstream of the critical weirdness though. There are occasional short bursts of vertex work surrounded by fragment and blend work such as the one where the cursor is in the image. Given the priorities, that vertex work should have been scheduled as soon as it appeared, but I can't explain why it just appeared then. Even if it were backlogged on a pending scoreboard, each thread has run both blend and fragment work and those should have been pre-empted. Order is always the whipping boy for weird things, but I haven't pieced together the full narrative yet.
It's worth noting that the lazy scheduler has exactly footprint trade-off against the eager scheduler that one would predict. Here is the same occupancy picture for the eager scheduler. It goes through far more vertex - fragment - blend cycles and has the pipeline bubbles from exhausting all fragment work, but not being able to fill the machine with vertex-work alone. The thing is, lazy does not have significantly better thread slot occupancy and (unsurprisingly) it has a much higher queue footprint. The lazy scheduler gets 95.3% occupancy with a 5.5 meg worst case queue footprint. The eager scheduler gets 95.0% -- almost the same -- but has a 1.3 meg footprint. Clearly I need to understand the bubble in the lazy scheduler. It may also be that eager will be unfairly successful so long as scheduling is 'free'.
On that note, it's probably time to put together some synthetic pipelines of various topologies, but with trivial kernels.

3 October 2008:

A crowd of NVIDIA folks came to visit to discuss and poke holes in GRAMPS. An interesting point, obvious in hindsight, is that if (a) the person designing your application graph has painstakingly and lovingly sized the queues just right and (b) each queue comes from a separate pool so there's no incentive to make one shorter dynamically to add capacity to another then lazy scheduling is perfect (wait until queue fills to preempt). It's really a generalization of my discovery / claim that you want to schedule graph cycles by prioritizing the 'top' of the cycle and carefully sizing the queues inside it. Otherwise the discussion was interesting and diverse, but didn't offer up any stark insights or future work / gaps. I am reminded of my curiosity of just how much is fakable in CUDA in the present or near-future term, but this wasn't really the venue for me to dive probingly into that.
After a heart to heart with the micro core scheduler, GPU-like GRAMPS now plays well with grampsviz too. Here's one fat core and four micro cores running D3D for a 1024x1024 courtyard. For comparison, here's CPU-like. My superficial impression is that the time spent in FB shows up a lot more in GPU-like than CPU-like. Given GPU-like's 100 cycle memory latency, that's not so peculiar.

2 October 2008:

It's GRAMPS-fest. After multiple re-inventions, I've sent the final version to TOG for publication. In addition, Kayvon and I gave two presentations over the summer: one to some Intel people and one to some folks from AMD.

27 May 2008:

AMD came to visit as part of the PPL bootstrapping process and it proved a convenient excuse to put together a GRAMPS overview presentation.

18 April 2008:

An interesting question: should GRAMPS include an explicit fork/join concept? Specifically, how helpful is join? It seems like fork/join is syntactic sugar given push and that it's basically a question of message-passing, continuation driven programming versus the illusion of synchronicity. Join does not seem to add an inherently new expressiveness, but it may be that incorporating it into GRAMPS allows better implementations than layering it. I will admit to being coloured by systems where fork/join style operations are (i) within a stage (texturing is the easiest example) and (ii) are either literally done synchronously with magic hardware that supports vastly more threads than execution slots or are unrolled and software threaded by the compiler / toolchain before they get handed to the run-time system and the run-time system only has the services and (and concept / tuning) of the transformed version. In those cases, fork/join function calls feel to me closer to 'how do I write good programs for my stages' than 'how do I structure my application' so, while they could be added to GRAMPS, they are independent from and not directly motivated by its central vision.
And my readings on JavaScript, AJAX, and Ruby on Rails are interesting in their own way. It's such a string-and-bubble-gum environment-- a lot like shell and perl scripting in space, err online. I now find myself disentangling and digging through the source to web pages out of curiosity, but other than finally installing Flashblock (which was really a bit of a non-sequitur, but has been on my list for a while), I haven't tried to take any implementation steps of my own. It's really tempting to try and write a terminal text based frontend for firefox or add a JavaScript interpreter to lynx (though I'm worried it effectively does immediate mode rendering and doesn't actually build out a full DOM, which would make incorporating JavaScript a real nuisance).
Interesting. Apparently, at least some versions of elinks can use SpiderMonkey to do limited Javascript. It's too bad elinks is so much clunkier looking than lynx (in fairness, it tries to replicate a full fledged renderer when that's actually counter to my interests).

3 April 2008:

In other news, I've decided to enhance my kneejerk disdain of Web 2.0 to an informed and snobbish disdain of Web 2.0 (and really, of any web pages not best viewed with my browser of choice) so I'm cadging web development mentoring from the HCI and Vis people all over the lab.
After pondering, despite an intense effort to update the system and paper for Graphics Hardware, GRAMPS is going to TOG. Time will tell whether that was a better choice for our goals. That other paper to which I contributed was accepted with the astonishing granting of two pages' additional length (in exchange for providing more details about the renderer). And due to consternation over perceived connotations of author lists, I'm losing my entertaining anchor position as the last author.

28 February 2008:

I used my gcafe slot as an excuse and motivation to spend some time reviewing and updating my thinking about REYES from the Autumn. I put together these slides relating REYES to Direct3D / GL rendering and discussing strategies for running REYES and REYES-like algorithms on GPUs and future GPU-based designs.
Apparently, it worked as Pat declared at the end that when I came in as a systems person he had wondered if I'd ever be a graphics person, but that I'd just demonstrated I was legitimately a graphics person. I wonder if this means I should stop telling people who ask about my Ph.D. program that I'm pretending to be a graphics person.

3 January 2008:

Apparently the radical new visual paradigm of PowerPoint 2007 can only stymie Kayvon for a few minutes. After grappling with his bafflement, he was able to whip out this nice image of the GRAMPS graphs for how we actually expect real versions of our applications to look.
Oh, and printing from Vista sucks.

8 November 2007:

When is a door not a door? When it's ajar. Apparently the same principle applies to libelf. RedHat ships a libelf that's totally different from the libelf available from the FSF. It's versioned differently and the files don't even look like forks of each other (or else they're old forks). Conveniently, Kayvon used exactly one function that's only defined in the FSF version.

6 November 2007:

There are two types of research projects. In one, someone has a task / problem / or activity they think is done totally wrong today and 'know' how it really should work. In the other, someone has something they think they can build, or at least describe, and that's their driving motivation. It can be very frustrating to try and understand the latter type of projects because all of their examples end up feeling artificial / forced / less appealing than current alternatives. It can be even more frustrating when their perspective is, "I don't know. What do you think would be a good application for it?"
And now we have syscalls.

1 November 2007:

Victory. Solomon has produced (with only mild heckling) a binutils and gcc that emits XPU vector instructions mixed with MIPS64 assembly. Now if only that slacker Kayvon had the interpreter ready. And we had instructions other than load, store, and add...

22 October 2007:

The GPU kd-tree paper I reviewed a while back has emerged (should I not out myself as a reviewier? I figure I'm a pretty predictable candidate, I already commented on it at the time, I liked the paper, and I think my comments were entirely fair so I have no reservations being known). I haven't yet reread it in detail, but their final paragraph is a little, well, amusing.
```
   The G80 with its 128 scalar arithmetic units running at 1.3 GHz
   should deliver over 160 GFlops, meaning that _tracing one ray costs
   about 10,000 cycles_ [!! emphasis added].  We suspect the main
   bottleneck to be the large number of registers in the compiled
   code, which limits the occupancy of the GPU to less than 33%.
   Unforunately, although the program requires much less registers,
   the CUDA compiler is not yet mature enough and cannot aid in
   reducing their count.  An option would be to rewrite the whole
   CUDA code in PTX intermediate assembly.
```
The unfortunate CUDA compiler is an old familiar whipping boy of mine. I wonder if they've figured out yet that the intermediate assembly doesn't help with register allocation (it's entirely done by the backend compilation to the final form) or if NVIDIA has relented on that stance. I resorted to using the shared cache as excess register file and manually spilling my ray data there while computing intersections. That saved a huge number of registers without limiting parallelism (since I was still slightly register bound, not shared space bound). And of course, I have to wonder what happens when they discover that no amount of parallelism can hide the divergent execution penalty within a SIMD group, warp, or whatever they're called these days.
I realize I don't think our "Ray Tracing on Cell" poster is available anywhere. The abstract should be in the RT06 docs, but not the poster, and the paper repository does not include posters. There's not really much content to it, but I'd still rather have the poster available. So, here.

20 October 2007:

A quick observation about z-batching (from the Gelato guys' GPU hidden surface elimination paper), since I had to stare at the paragraph for a while before I realized what they meant-- the goal of z-batching is actually to make the batches as small as possible. I found this counter-intuitive since generally when one names something "batching", the goal is aggregation. However, in this case the performance pressure is to drive down batch size: performance is O((N/B) * B^2) for disjoint batches of size B. Thus, as B approaches N, this degrades back to O(N^2) whereas as B approaches 1 this goes to O(N). Note that disjoint requirement, however. With overlap, a batch has to be padded out to include all the grids after it that overlap its z-range. As such, overlapping grids impose a certain minimum effective batch size. It seems like something even slightly clever during grid sorting could nicely guide either a fixed or adaptive batch size.
On a slightly related note, I'd like to register a formal complaint about the term "indirect framebuffer". The entity so described bears no semblance to a framebuffer and the ridiculously overblown name impairs credibility (and to some extent comprehensibility). It's more of an index or indirection one could use to formulate an indirect framebuffer rather than any sort of framebuffer itself.

18 October 2007:

Man. I spend no time in my office anymore. Lots and lots of talking, but no writing or coding or anything. So, rather than record any thoughts, I'll just make a note that we debuted GRAMPS on Tuesday. Pat's already expressing anxiety about the name. A casual search only turned up an obviously unrelated geneology project and a SIGGRAPH paper from 1981. Kayvon and I agreed that the statute of limitations on that has already expired.

11 October 2007:

Here's how I break down REYES' execution:
- Bounding, Culling, and Splitting
- Dicing
- Displacement Shading
- Surface, Lighting, etc. Shading
- Rasterization (really!)
- Blending
The step I call "rasterization" has a variety of other names in the REYES and RenderMan literature, but it is entirely rasterization! A collection of surface patches in object space are tranformed to image space and sampled according to pixel (and subpixel) locations. OpenGL and Direct3D call the results fragment while RenderMan calls the results visible points, but they are the same results of the same process. Credit to Kurt for hammering home that not only is this "just like rasterization" but it simply is rasterization. Note that the REYES rasterizer is differently complex than the GL/D3D rasterizer. The REYES rasterizer is simpler because all it ever gets are shading grids (effectively quad strips) and they're very regular. The REYES rasterizer is more complicated because it supports interpolated primitives. I.e. grids that have an associated parameterized location and whose parameter varies among different samples. This is REYES' "stochastic sampling", but in practice it seems to me specifically interpolation rather than a more general scheme. It's also absolutely fundamental to REYES as opposed to a rare corner case that can be marginalized or ignored.
Lpics strikes me as weird because they took this framework and inserted themselves logically in the middle. They split out just the lighting portion of shading and converted it to run in image space instead of object space. Then they turned off transparency so that visible points corresponded one to one with pixel locations. Then they turned off antialiasing so that REYES' surface shading corresponded one to one with a GPU fragment's image space shading. Within this framework, they took their modified subset of shading, tranformed all the inputs to be in pixel space instead of object space, and ran them at the end of the GPU. I agree that it's clever and effective, but it does seem a little contorted. I suppose GPUs of that era basically mandated contorted in order to accomplish anything besides GL/D3D. The unfortunate and unexplained thing is the decision to run the GPU shaders in pixel space instead of object space. That inherently denies them the visibiliy supersampling that's so central to REYES. It doesn't seem like it would have been that hard to literally interpose in the middle instead of just logically. On DirectX 10 era GPUs, I'd actually take things a step further and cut-over the whole bottom of the REYES pipeline over to the GPU...
Frustratingly, I can't find anywhere that the NVIDIA Gelato documentation describes their rendering strategy. They make ridiculous (well, marketing driven drivel) statements about designing from the ground up for GPUs, but I couldn't even find a whitepaper level description of what they mean. Given their market, their support for RIB inputs, and various other constraints and effects, I strongly suspect they implement a customized, but REYES-derived, pipeline. Their design is adjusted to be GPU-aware, but still heavily balanced to compete with prman in functionality. I'd be intrigued by their opinion about what functionality would have to be jettisoned to compete with GL instead (and what advantages and disadvantages would remain if the result weren't simply equivalent).
Now that I'm actually learning about REYES and RenderMan, I went back to review the Lpics paper. I'm somewhere between bemused and confused by their choice of where and how to chop prman and insert a GPU. From my perspective, it's not at all the first obvious place, but the paper doesn't discuss alternatives or justify its choice. Obviously, Lpics was vastly useful and got a good speedup, so it's not as if their choice is flawed. It's just surprising and I wish they explained the surprise. After talking about it with Kayvon, though, I can sort of see how it might have seemed like the obvious place to them, at the time. It does fully constrain the GPU portion of the workload to gather-free "render a big quad and stream" style GPGPU programming. It just also gives up (a) a lot more shader execution that seems like it could run just fine on a GPU and (b) the entire visibility anti-aliasing that's fundamental to the REYES mindset. Also, I desperately wish they'd given some numbers or guesses at the performance they would have without simplifying the shaders. They report a 20000:1 speedup with an Lpics render of 0.1 second per frame. I'd be stunned if the simplification had any material impact on that ratio. Even if the simplification provided a 10x boost, unsimplified shading would only take 1 second per frame and still be 2000x faster than prman. The paper says that without simplification their shaders cannot run at interactive rates, but it also says the jump from NV30 to NV40 produced a 10x performance gain. I wonder if those shaders were still non-interactive (and how much) on NV40 and how they run on a current GPU. NV30 also had much harsher instruction length limits and even instruction set limits than exist today.
It sounds like Lightspeed chops things up more like I would envision. However, when Jonathan presented it here last school year, I didn't really have the context to understand enough that my retention is any good so I only have offhand remarks from Kayvon and Jonathan to back that up.
Vladlen's talk about content creation through design spaces was very interesting, although it suffered from promising vastly more than it delivered. Needlessly, I might add. What it does is cool enough that overclaiming was just distracting. And, like all texture synthesis, image editing, and other such applications, it's an excellent framework for a whole bunch of 'adult' applications. I'm still waiting for the series of papers about applications of computer graphics techniques to pornography. You can tell from the artwork at SIGGRAPH, that it's hardly far from many people's minds...

5 October 2007:

I'm not done talking about REYES, but am temporarily short on time. Having established the background, I'd like to articulate the few more instrinsic distinctions and then the more stylistic / conventional distinctions (akin to programming in different turing complete languages giving rise to very different things seeming "natural").
I talked with (more talked at) another of the new Ph.D. students. It was good to be challenged to present my take on things and try and verbalize my current thinking beyond the tactical issues of settling a GL/D3D mapping for the little core portion of our system. I should really take some time and write it out concretely so that I focus my development efforts appropriately. As always, there's just too many possibilities and ideas to work efficiently without conscious bigger picture steering.

4 October 2007:

Another day of reading about REYES and PRMan and trying to what is algorithmic, what is implementation, and what is just like a GL/D3D system. I went to Kurt to ask what he'd seen as the distinctions when RenderMan and GL were both young and in the context of today's largely programmable GL. Interestingly, at a high level / from an API perspective, he felt there was not much difference for generating movie / picture type images and that GL offered a lot more control and specification over details that were critical to people using graphics as a tool for engineering or other design. His argument is that REYES is basically classical GL: transformed and shaded vertices that are flat shaded and then rasterized into pixels. The dicing / tessellation stages correspond to the same spline evaluators that have been part of GL since the beginning. Kurt identified the hard-coded depth-order hidden surface elimination scheme as a potential weakness for applications wishing to explore alternatives (contrasted with GL's guarantee of in-order rendering with optional additional tests such as depth-buffering).
It's an interesting perspective that matched a lot of my analysis while still having some divergence. I definitely believe and agreed that the post-dicing REYES model is roughly equivalent to GL vertex shading and then rasterization without (or with trivial) fragment shading. While I agree that technically the GL evaluators resemble the REYES dicing, I think the fact that they've essentially remaining vestigial pieces of GL ignored by the hardware and applications makes the congruence weak. The REYES rasterizer is also a different creature from the GL rasterizer. The REYES "rasterizer" knows that its primitives are only quads and whose area is only on the order of a single pixel. However, it also has to support "stochastic sampling", which is really just a fancy way of saying it has to additionally sample in time or lens position for motion blur or depth of field. Arguably the most major difference, though, is that the REYES rasterizer delivers order independent transparency. It cannot commit any samples to a pixel until all samples are available and sorted in depth order (or at least all non-opaque samples). The driving motivation for the bucketing scheme in REYES / PRMan was to limit the duration over which samples for a pixel needed to be kept around uncommitted.
Anyhow, after the Gates kick-off party and a ray tracing interlude with Solomon, Jonathan showed up. He's coyly extracting free plane tickets in exchange for enduring a job interview or two (secretly he thrives on the wheeling and dealing). He wanted to hear what we're up to and Kayvon's running off to Seattle for the weekend and Mike's involved in delicate political wrangling over trade press articles on GPU programming (sucker). Since it's been the topic of the week, I ran through assemble threads, shader threads, and the whole fan-in / fan-out / vertex caching / data sharing situation quickly and then sprung "GPU REYES" on him, too.
We agreed that the two interesting questions are: is there a picture REYES can make that GL cannot (and vice versa, but restricting to photo-like pictures) and in areas where the two are technically equivalent, what idioms / abilities are more naturally associated with REYES vs. GL? At the heart of both is an attempt to understand why, from a technical perspective, people want realtime REYES (because some definitely do), and what it is that they actually want. Clearly there's the fantasy of wanting PRMan renders at GPU rates, but just as clearly that's fantasy. Something we think is open and not fantasy is what implementation of REYES / subset of PRMan is practical for realtime / multicore / "GPU-like" implementation if we successfully produce an architecture that generalizes GPUs enough to accomodate other alternatives.
And a bit of an interlude wherein Solomon and I continued our dispute about what matters in interactive ray tracing research, why I (and others in the lab) have come to believe that shading coherence is such a key attribute, and what the opportunities and burdens of proof are. It's nice to have someone around again who's into ray tracing and has a more rooted background than my self-taught one. It'll take more back and forth before we've shared enough background to really argue over what's important, but it should be fun.

2 October 2007:

I've started trying to understand what I'll call, for lack of a better term, GPU REYES. We've been talking about alternate rendering methods for our system. I have a good understanding of the details of ray tracing and Kayvon tackled GPU style rasterization over the summer. The third major model we discuss a lot is REYES... [ And I've got to run now, sadly ]
More silence. Or rather, so many conversations in other venues that they've satisfied the urge (and consumed the time) that would have led to recording those thoughts here. Kayvon's been nice and productive this summer (like always) while I've talked a lot, but contributed little (also like always).
At least I have a reading committee now. If I'd realized I was going to do it without a specific extended thesis proposal, I'd have submitted the form back in February. Ah well, at least my thinking is more organized now.

Older notes from '05-'06 and '06-'07.