Image Sensor Bandwidth Optimization

Over the past half-decade, the world has witnessed an explosion of digital photography. Ever-increasing sensor resolutions continue to drive the functionality and image quality of digital cameras past their film-based predecessors in every market they have served. The commoditization of CMOS image sensors has promised to bring Moore's Law to image sensor resolution. Advances in the development of Digital Pixel Sensors (DPS), such as the high dynamic range and high frame-rate sensors of the Stanford Image Sensor Laboratory [1,2], promise further explosions of data in a new era of digital sensor technology, no longer limited by Analog-to-Digital Conversion (ADC). However, as sensor resolution grows with CMOS process density, and as new, all-digital sensors enable an entire new class of data to be captured, sensor data capture rates promise to rapidly surpass interconnect bandwidth much as CMOS logic systems have outstripped the memory and communication systems which support them.

While DPS systems efficiently digitize images at the pixel level, full camera systems still require a suite of separate image processing chips to perform demosaicing, tone-mapping, and compression to generate a final output image. The processing capacity of this image processing pipeline, like the capture capabilities of the sensor, will grow rapidly in accordance with Moore's Law. Thus, the greatly increased data capture capabilities of the modern DPS threaten to make the bandwidth of communication between chips in this system the sole limitation of camera performance, where ADC and image processing were the key limitations of the past. As is well known from the world of conventional computer systems, interconnect performance scales near memory performance, around the rate of industrial growth – many times slower than the growth patterns described by Moore's Law. In such a world, the capabilities of camera systems would be limited to grow according to the patterns of conventional industry, taking little to no advantage of the rapidly scaling technologies of DPS.

We look, then, to a new capability of DPS to alleviate chip-to-chip bandwidth limitations through a more efficient use of the interconnects. CMOS Digital Pixel Sensors act and are fabricated as SRAMs. As such, they can be built integrated with conventional CMOS logic circuits. Indeed, the DPS design, itself, relies on a substantial degree of CMOS logic in each sensor element to implement ADC per-pixel. The first DPS constructed employed an 8-bit comparator and an 8-bit SRAM per-pixel, for a total of 27 transistors per-pixel, compared to 3 in a conventional analog CMOS sensor [1]. As CMOS technology scales, the logic capacity of the DPS chip will grow accordingly. Future systems will be able to employ this on-chip logic to minimize chip-to-chip communications by efficiently sampling and selectively transferring the sensor's ever-expanding mass of captured data.

Inspired by the possibilities for random pixel read-out in DPS, I set out to investigate potential additions to DPS architectures to maximize data efficiency. I began with the goal of studying possible DPS architectures through a sensor simulator which I would construct in a ray-tracer, LRT. This goal, however, proved excessively challenging without maintaining a sufficient focus on issues of digital photography. I chose, then, to focus directly on the exploration of sensor designs, which I explored through ad-hoc simulations, rather than an excessively general sensor simulator.

In this paper, I present three applications of DPS integrated logic to efficient off-chip interconnect utilization. These applications address scaling still image resolution, video resolution, and dynamic range. For each I describe a logical foundation, detail the complexities of potential hardware implementations, and present simulated results. I show the successes of each, and attempt to illustrate where they might fall short in real-world applications.

I first set out to investigate whether an adaptive sampling methodology could be applied to efficient sampling of high resolution still images. Many traditional computer graphics algorithms exploit knowledge of signal reconstruction to adaptively sample more detailed image data more precisely than that with less detail. Such a methodology of adapting sampling frequency to local detail seems ideally suited to efficient data use in high resolution digital image capture.

I first conceived of a DPS design which achieves adaptive sampling through local comparisons. The pixel comparison network consists of a conventional pixel array extended to allow pixels to compare themselves to their local neighborhoods. In this design, groups of pixels collaborate, comparing themselves to one another, to determine which pixels need to be read out for maximum image integrity. A simple implementation would allow only small, regional comparisons, while a more advanced implementation might employ a hierarchical network of comparisons to characterize the sampling of the image at multiple frequencies.

Such a network seems, at first, simple to implement. Its core operation is the comparison, logic already implemented per-pixel in any DPS. It is built on pixel-level operations which can easily be embedded in-pixel with contemporary process technology. However, the construction of an effective pixel comparison network, requiring multiple levels of detail, is severely hindered by the myriad lines of communication necessary for in-pixel multiple level-of-detail comparisons. Further, the random pixel order of the adaptively sampled output stream is potentially difficult to organize efficiently without requiring excessive address information per-pixel, potentially killing all data efficiency gained through adaptive sampling.

The fundamental flaw of the comparison network is its approach to logic integration. While per-pixel logic is ideal for scalable systems, the design betrays my misconception that CMOS integrated logic is best implemented physically in-pixel. With fast pixel-level ADC, DPS are digitally addressable at fast SRAM speeds. Because the DPS can perform ADC and temporary storage in-pixel, any additional logic can be constructed and interconnected in a pixel-parallel fashion, but reside outside the active pixel sensor area on-die, acting on the captured image as on memory.

Conventional graphics suggests an alternative implementation. In this approach, a Gaussian pyramid is employed to represent multiple levels of detail, and data from this pyramid is sampled from low to high resolution as necessary for each pixel. First, the image is processed into a multi-resolution pyramid. From this pyramid, samples are scanned out from each increasing levels of detail. Higher resolutions are selectively scanned out whenever lower resolutions are insufficiently similar to their higher resolution counterparts. The image can then be reconstructed by interpolating and combining the samples from progressive levels of detail in the multi-resolution pyramid.

The multi-resolution data structure can be efficiently represented as a quadtree rooted at the lowest level of detail [fig. 1]. In this structure, leaf nodes represent the highest level of detail needed to reasonably represent a given sample or group of samples. Nodes in the tree can be represented with just 4 bits of non-pixel data to convey which children exist beneath the given node. The data can then be easily read out in a fixed order and reconstructed from just these 4 bits of additional data per-sample.

The Gaussian pyramid adaptive sampler lends itself to a simple and straightforward hardware implementation. The algorithm requires only an adder and a comparator for determining whether the difference between a sample and its approximation are within an acceptable threshold for the sample to be ignored, as well as a separate unit to compute the multi-resolution pyramid. The comparator and adder combined require fewer than 50 transistors, while the multi-resolution pyramid can be constructed using a conventional SIMD processor array. The SIMD processor array need only perform up to one 4-to-1 pixel reduction, each consisting of three adds and one multiply, for every three sensor pixels. For a hypothetical 4 megapixel sensor, this corresponds to 4 million adds and 1 million multiplies, easily accomplished in a short period of time with a relatively small fixed-point SIMD array of just a few hundred thousand logic transistors. This can be implemented easily in 180nm bulk silicon with a relatively insignificant design effort.

Adaptive sampling of output data represents a form of simple on-chip image compression [fig. 2]. As such, we can analyze its effectiveness as a lossy image codec. Using just this simple method, I was able to achieve visually acceptable results at a little over 50% data rate. As desired, the adaptively sampled images maintain high detail while requiring proportionally less data for their representation. Difference images show that detail is maintained in regions of fine detail, while much of the difference appears as noise. Detailed regions, such as the archways and windows in the atrium scene are maintained while noisy high frequencies within a small threshold are discarded.

While these results are encouraging, and do allow for as much as a 50% improvement in bandwidth utilization off-chip from the image sensor, they suggest a more dramatic near-future evolution: the movement to formal on-chip image compression. With hardware JPEG codecs occupying ever-smaller die real estate, final image compression on-chip becomes an increasing possibility. Before such measures as spatially adaptive sampling are necessary for still image bandwidth reduction, JPEG may already be available on-chip.

For on-chip JPEG to be possible in color sensors, demosaicing must be performed on-chip. However, as sensor resolution outstrips image resolution, this becomes increasingly straightforward. Where patented black magic was once required to demosiac color images effectively, increasingly optically blurry images may actually enable a return to simple interpolation for demonsaicing.

While increasing resolution presents potential problems for low-cost, high resolution still camera systems, high resolution sensors present an explosion of data in video applications. 5 megapixel sensors capture 15 MB frames at 8 bits per channel. While this may be manageable at still camera capture rates, it represents a massive 1 GB/s at 60hz. Even still cameras gain substantial flexibility through continuous capture, broadening the application of high resolution, high frame-rate capture beyond video alone. Thus we are led to consider adaptive sampling in the temporal domain.

Prior graphics research, such as the Frameless Rendering methodology proposed by Bishop, et al. [3], have suggested that the progressive update of images over time can maintain acceptable image quality. We look, then, to sample mirror this frameless presentation of time-varying data in video applications.

Such an approach can be implemented effectively as a selective read-out system. Pixels are chosen for update based on their degree of change, and all pixels within a fixed budget are updated for each captured frame. For every frame, pixels are compared to their last output value. Based on their difference from their last known representation, they are enqueued for update. A client sits on the queue and dequeues as many pixels for update as it is able to in the span of one frame's capture.

This algorithm can be implemented an adder and a priority queue. The adder calculates the difference between each pixel and its last output value at every frame. Each pixel is then priority-enqueued according to this difference calculated by the adder. The logic required for the per-pixel adder is negligible, but the priority queue is potentially challenging to implement on-chip. A 5 megapixel sensor at 60hz generates 300 million pixels/s which must be sorted by the priority queue.

In scenes with varying degrees of change, the queue load can be significantly alleviated using a simple threshold implemented using a single comparator. All pixels changes below the threshold are discarded to minimize the queue size. The image quality change should be negligible, as the threshold should eliminate only those pixels which would not have been dequeued within the given update budget. Still, while a simple threshold can greatly reduce the average size of the queue, it cannot guarantee a maximum queue size.

To achieve a consistently maintainable level of performance, the image can be tiled into regions which are served by local queues. The local queues have a guaranteed minimum performance corresponding to a sort of all pixels in their tile, rather than all pixels in the image. This tiled design offers the further advantage of scalability. Rather than relying on a monolithic queue to sort all pixels in the image, it employs smaller, fixed-size elements which can be replicated to serve more tiles of similar size in higher resolution image sensors.

In scenes with small degrees of change, this approach is able to achieve 3:1 compression of data with acceptable loss in quality [fig. 3]. However, for scenes with more moderate frame-to-frame changes, it is unable to maintain high image quality at moderate data rates. Maximum data loss due to the selective update process remains low, but the approach fails simply because it produces particularly unpleasant artifacts such as posterized regions with defined, pixilated edges. A static enqueue threshold further exaggerates these artifacts by causing unpleasant streaking in slow-moving low-contrast regions.

While this frameless video approach to temporally adaptive sampling consistently achieves at least some savings in all scenes, it is unable to offer sufficient savings consistently while maintaining strong image quality to warrant the complexity of implementation. The architecture of the approach is straightforward, but the scale of the sorting problem required to prioritize pixel update is inefficient to implement in hardware given the modest bandwidth gains achieved in the worst case.

What the frameless video concept does suggest, however, is the potential utility of in-sensor video compression. While the approach might be useful with lower hardware requirements, the complexity of its implementation, and yet its feasibility in near-future CMOS processes, suggests that in-sensor MPEG compression at high resolutions may soon be feasible.

Further, while the perceptual quality of the images produced by the frameless video method degrades rapidly as motion increases and pixel budgets decrease, the relatively constant mathematical error of the sampling method – bounded by the pixel update budget – coupled with the transmission of data only in regions of change may lend itself to vision systems where perceptual quality is unimportant and processing power needs to be applied where it is needed most: where the image is changing.

The ADC performance of DPS systems is creating an explosion in the frame-rate of digital photographic systems. Early DPS have been designed specifically for their high temporal sampling rate. The Stanford Image Sensor Laboratory has employed this high sampling rate to construct unique sensors with extremely high dynamic range and video frame-rates [1]. Others have suggested a range of applications which may be enabled or greatly improved by high temporal sampling rates. This then presents the challenge of maximizing use of the new wealth of data while continuing to output a similar amount.

The first critical application of a high-rate DPS achieved low-noise high dynamic range capture through multiple exposure [1]. The system employed the high sampling rate of the DPS to capture multiple exposures in the time previously required to digitize a single exposure. However, while multi-exposure high dynamic range capture traditionally yields dramatically lowered noise compared to single exposure approaches, the multi-exposure approach is inherently limited by the need to maintain alignment between multiple sequential exposures. In real-world applications, this can require problematically short integration times, which in turn yield extremely noisy images.

Additionally, increased temporal sampling density has been shown to greatly improve the accuracy of Optical Flow Estimation (OFE). The Lucas-Kanade gradient-based approach scales uniquely well with temporal sampling frequency. The precision of the solution increases proportionally to the sampling frequency, while the computational intensity of the solution is linear in the number of samples. SukHwan Lim recently demonstrated an improved Lucas-Kanade approach designed for such high sampling frequencies [4]. His custom, iterative implementation achieves approximately twice the accuracy of the conventional Lucas-Kanade approach with similar computational intensity. Lim proposes a feasible in-sensor hardware implementation of his highly precise OFE engine in 130nm silicon which would become simple and straightforward by the 100nm feature-size generation.

As Lim shows, we are able to achieve very accurate OFE with significant integrated logic real estate to spare on a 100nm silicon CMOS DPS. Utilizing some additional real-estate to implement an image warping engine, our accurate optical flow estimate can then be applied to the problem of maximizing the utility of our high frequency data. Utilizing a Lucas-Kanade OFE across a multi-exposure time series, I am able to accurately estimate the flow between sequential images of similar exposure. I then use conventional methods to warp sequential exposures along their respective flows to a common sample time to register the images. From this reconstructed common time I can reconstruct a single high dynamic range exposure using a conventional multi-exposure approach, and output the data in a compact 32-bit RGBE format.

In synthetic tests, this method achieves high quality, high dynamic range images [fig. 4]. Some data is lost in the reconstruction of the temporally interpolated exposure images. This loss is visible in a slight blur in high detail regions. However, it is largely insignificant in the reconstruction of high dynamic range. The error represents a slight loss of spatial resolution, but induces little to no loss in the accuracy of the high dynamic range reconstruction as occurs with improperly registered exposures. The error would grow as the OFE lost accuracy in fast-moving scenes. However, the quality and motion tolerance of the registration could be improved further by using the iterative extension to Lucas-Kanade proposed by Lim. The flows from multiple exposures could be compared to produce a dramatically more accurate sub-frame final flow estimate, improved proportionally to the number of exposure samples per output frame.

While not clearly conventional compression, these results can be seen to correspond to compression of high frame-rate data. By acquiring accurate high dynamic range, it represents the compression of multiple 24-bit low dynamic range exposures to a single 32-bit high dynamic range output. Similarly, high precision optical flow estimates can be output along side standard frame-rate video to further maximize the utility of high frame-rate capture.

I began my work intending to experiment with sensor technologies through an elaborate and general sensor simulation framework. Aside from representing an excessively large task in and of itself, this project led me down a path of excessive unrelated debugging and development. As my initial implementation progressed, I became increasingly aware of undocumented missing or broken features in the ray-tracer framework with which I was working, LRT. I wasted significant unnecessary time attempting to debug the system when basic features which appeared to be implemented turned out to be deprecated or broken without notice.

As it became clear that I was wasting time unnecessarily on this overly complex system, I elected to employ ad hoc simulations of the various algorithms which I wished to explore. However, I was then faced with the surprising challenge of collecting or synthesizing useful data. When addressing sampling of complex scenes, it is necessary to work with realistically complex data in order to have any sense of the real-world results which might be produced by the sensor designs I explored. As realistic image synthesis continues to be a central challenge in the field of computer graphics, such realistic data proved extremely time consuming to make and to render, especially at such high frame-rates as those required for some of my experiments. All of the data which I was able to use -- photographic or synthetic -- was my own work, aside from the geometry of the architectural scene. Creating complex and interesting test scenes required a very large amount of work, even from the existing architectural model. Further, some of the data which I created had to represent a very high dynamic range scene in order to effectively test high dynamic range reconstruction. Setting up such a realistic, detailed, and high dynamic range scene took a few days of work, even from an existing model. Rendering the scene proved to be similarly challenging and time-consuming, requiring days to render sufficiently dense sequences, only to require several days of re-rendering after a correction had to be made to the data.

I had hoped to conclude my work with real-world data from a cutting-edge DPS, the current El Gamal high frame-rate sensor. However, this proved challenged more for reasons of bureaucratic disorganization than anything technical. Every member of the group which developed the sensor had a different notion of why it would or would not work for my various applications, leading to constant vacillation about actually acquiring data for my project. When I did finally set aside a half-day to capture data, I was told upon arrival that the camera system was broken and would be unavailable through the end of the year. Generally indicative of my experiences throughout my attempts to capture real-world data, the individuals responsible for the camera system knew of the failure a week before my day of shooting, but failed to tell me before I actually arrived in the lab.

Beyond the challenges I faced in the synthesis and acquisition of useful, sufficiently realistic data, I was set back for some time by my own misconceptions of image sensor design and integration. As with others in the graphics lab, I conceived of integration as the incorporation of logic into the pixel. I failed to realize the flexibility afforded by on-chip computational elements which resided outside of the image plane, like the processing elements in a conventional CMOS product. This misconception led me down a number of interesting paths which ultimately proved to be irrelevant and unfounded in the real world.

Finally, while I was able to design useful adaptive sampling systems which generated moderately interesting results, above all I discovered through my investigation into their feasibility that they may have no useful application. While both the spatial and temporal sampling methods I explored are feasible to implement in hardware, they will likely be surpassed by full-fledged in-sensor image and video compression before they become useful or feasible.

However, the concept of optimal off-sensor bandwidth usage in the face of ever-increasing volumes of image data found great application in my seemingly tangential registered high dynamic range capture method. It was the practical shortcomings of the strict adaptive sampling methods I explored, and the sensor technology which I learned in exploring them, which led me to rethink my over-arching goal to that of generally efficient bandwidth utilization amid a swelling sea of image data.

For their tremendous insight and willingness to help me in my relative ignorance of electrical engineering, I am grateful to Ting Chen and SukHwan Lim of the Stanford Image Sensor Lab. For initially advising me on image sensor technology, I thank Bennett Wilburn. And finally for bringing me through two rounds of project advice, and for teaching me so much about cameras and photography, I thank Steve Marschner and Marc Levoy.