 |
 |
| 
|
| Over
the past half-decade, the world has witnessed an explosion of digital
photography. Ever-increasing sensor resolutions continue to drive
the functionality and image quality of digital cameras past their
film-based predecessors in every market they have served. The commoditization
of CMOS image sensors has promised to bring Moore's Law to
image sensor resolution. Advances in the development of Digital
Pixel Sensors (DPS), such as the high dynamic range and high frame-rate
sensors of the Stanford Image Sensor Laboratory [1,2],
promise further explosions of data in a new era of digital sensor
technology, no longer limited by Analog-to-Digital Conversion (ADC).
However, as sensor resolution grows with CMOS process density, and
as new, all-digital sensors enable an entire new class of data to
be captured, sensor data capture rates promise to rapidly surpass
interconnect bandwidth much as CMOS logic systems have outstripped
the memory and communication systems which support them.
While
DPS systems efficiently digitize images at the pixel level, full
camera systems still require a suite of separate image processing
chips to perform demosaicing, tone-mapping, and compression to generate
a final output image. The processing capacity of this image processing
pipeline, like the capture capabilities of the sensor, will grow
rapidly in accordance with Moore's Law. Thus, the greatly
increased data capture capabilities of the modern DPS threaten to
make the bandwidth of communication between chips in this system
the sole limitation of camera performance, where ADC and image processing
were the key limitations of the past. As is well known from the
world of conventional computer systems, interconnect performance
scales near memory performance, around the rate of industrial growth
– many times slower than the growth patterns described by
Moore's Law. In such a world, the capabilities of camera systems
would be limited to grow according to the patterns of conventional
industry, taking little to no advantage of the rapidly scaling technologies
of DPS.
We
look, then, to a new capability of DPS to alleviate chip-to-chip
bandwidth limitations through a more efficient use of the interconnects.
CMOS Digital Pixel Sensors act and are fabricated as SRAMs. As such,
they can be built integrated with conventional CMOS logic circuits.
Indeed, the DPS design, itself, relies on a substantial degree of
CMOS logic in each sensor element to implement ADC per-pixel. The
first DPS constructed employed an 8-bit comparator and an 8-bit
SRAM per-pixel, for a total of 27 transistors per-pixel, compared
to 3 in a conventional analog CMOS sensor [1].
As CMOS technology scales, the logic capacity of the DPS chip will
grow accordingly. Future systems will be able to employ this on-chip
logic to minimize chip-to-chip communications by efficiently sampling
and selectively transferring the sensor's ever-expanding mass
of captured data.
|
 |
| Inspired
by the possibilities for random pixel read-out in DPS, I set out
to investigate potential additions to DPS architectures to maximize
data efficiency. I began with the goal of studying possible DPS
architectures through a sensor simulator which I would construct
in a ray-tracer, LRT. This goal, however, proved excessively challenging
without maintaining a sufficient focus on issues of digital photography.
I chose, then, to focus directly on the exploration of sensor designs,
which I explored through ad-hoc simulations, rather than an excessively
general sensor simulator. |
 |
| In
this paper, I present three applications of DPS integrated logic
to efficient off-chip interconnect utilization. These applications
address scaling still image resolution, video resolution, and dynamic
range. For each I describe a logical foundation, detail the complexities
of potential hardware implementations, and present simulated results.
I show the successes of each, and attempt to illustrate where they
might fall short in real-world applications. |
|
 |
I
first set out to investigate whether an adaptive sampling methodology
could be applied to efficient sampling of high resolution still
images. Many traditional computer graphics algorithms exploit knowledge
of signal reconstruction to adaptively sample more detailed image
data more precisely than that with less detail. Such a methodology
of adapting sampling frequency to local detail seems ideally suited
to efficient data use in high resolution digital image capture. |
 |
I
first conceived of a DPS design which achieves adaptive sampling
through local comparisons. The pixel comparison network consists
of a conventional pixel array extended to allow pixels to compare
themselves to their local neighborhoods. In this design, groups
of pixels collaborate, comparing themselves to one another, to determine
which pixels need to be read out for maximum image integrity. A
simple implementation would allow only small, regional comparisons,
while a more advanced implementation might employ a hierarchical
network of comparisons to characterize the sampling of the image
at multiple frequencies.
Such
a network seems, at first, simple to implement. Its core operation
is the comparison, logic already implemented per-pixel in any DPS.
It is built on pixel-level operations which can easily be embedded
in-pixel with contemporary process technology. However, the construction
of an effective pixel comparison network, requiring multiple levels
of detail, is severely hindered by the myriad lines of communication
necessary for in-pixel multiple level-of-detail comparisons. Further,
the random pixel order of the adaptively sampled output stream is
potentially difficult to organize efficiently without requiring
excessive address information per-pixel, potentially killing all
data efficiency gained through adaptive sampling.
The
fundamental flaw of the comparison network is its approach to logic
integration. While per-pixel logic is ideal for scalable systems,
the design betrays my misconception that CMOS integrated logic is
best implemented physically in-pixel. With fast pixel-level ADC,
DPS are digitally addressable at fast SRAM speeds. Because the DPS
can perform ADC and temporary storage in-pixel, any additional logic
can be constructed and interconnected in a pixel-parallel fashion,
but reside outside the active pixel sensor area on-die, acting on
the captured image as on memory. |
 |
Conventional
graphics suggests an alternative implementation. In this approach,
a Gaussian pyramid is employed to represent multiple levels of detail,
and data from this pyramid is sampled from low to high resolution
as necessary for each pixel. First, the image is processed into
a multi-resolution pyramid. From this pyramid, samples are scanned
out from each increasing levels of detail. Higher resolutions are
selectively scanned out whenever lower resolutions are insufficiently
similar to their higher resolution counterparts. The image can then
be reconstructed by interpolating and combining the samples from
progressive levels of detail in the multi-resolution pyramid. |
 |
The
multi-resolution data structure can be efficiently represented as
a quadtree rooted at the lowest level of detail [fig.
1]. In this structure, leaf nodes represent the highest level
of detail needed to reasonably represent a given sample or group
of samples. Nodes in the tree can be represented with just 4 bits
of non-pixel data to convey which children exist beneath the given
node. The data can then be easily read out in a fixed order and
reconstructed from just these 4 bits of additional data per-sample. |
 |
The
Gaussian pyramid adaptive sampler lends itself to a simple and straightforward
hardware implementation. The algorithm requires only an adder and
a comparator for determining whether the difference between a sample
and its approximation are within an acceptable threshold for the
sample to be ignored, as well as a separate unit to compute the
multi-resolution pyramid. The comparator and adder combined require
fewer than 50 transistors, while the multi-resolution pyramid can
be constructed using a conventional SIMD processor array. The SIMD
processor array need only perform up to one 4-to-1 pixel reduction,
each consisting of three adds and one multiply, for every three
sensor pixels. For a hypothetical 4 megapixel sensor, this corresponds
to 4 million adds and 1 million multiplies, easily accomplished
in a short period of time with a relatively small fixed-point SIMD
array of just a few hundred thousand logic transistors. This can
be implemented easily in 180nm bulk silicon with a relatively insignificant
design effort. |
 |
Adaptive
sampling of output data represents a form of simple on-chip image
compression [fig. 2]. As
such, we can analyze its effectiveness as a lossy image codec. Using
just this simple method, I was able to achieve visually acceptable
results at a little over 50% data rate. As desired, the adaptively
sampled images maintain high detail while requiring proportionally
less data for their representation. Difference images show that
detail is maintained in regions of fine detail, while much of the
difference appears as noise. Detailed regions, such as the archways
and windows in the atrium scene are maintained while noisy high
frequencies within a small threshold are discarded.
While
these results are encouraging, and do allow for as much as a 50%
improvement in bandwidth utilization off-chip from the image sensor,
they suggest a more dramatic near-future evolution: the movement
to formal on-chip image compression. With hardware JPEG codecs occupying
ever-smaller die real estate, final image compression on-chip becomes
an increasing possibility. Before such measures as spatially adaptive
sampling are necessary for still image bandwidth reduction, JPEG
may already be available on-chip.
For
on-chip JPEG to be possible in color sensors, demosaicing must be
performed on-chip. However, as sensor resolution outstrips image
resolution, this becomes increasingly straightforward. Where patented
black magic was once required to demosiac color images effectively,
increasingly optically blurry images may actually enable a return
to simple interpolation for demonsaicing. |
|
 |
While
increasing resolution presents potential problems for low-cost,
high resolution still camera systems, high resolution sensors present
an explosion of data in video applications. 5 megapixel sensors
capture 15 MB frames at 8 bits per channel. While this may be manageable
at still camera capture rates, it represents a massive 1 GB/s at
60hz. Even still cameras gain substantial flexibility through continuous
capture, broadening the application of high resolution, high frame-rate
capture beyond video alone. Thus we are led to consider adaptive
sampling in the temporal domain.
Prior
graphics research, such as the Frameless Rendering methodology proposed
by Bishop, et al. [3],
have suggested that the progressive update of images over time can
maintain acceptable image quality. We look, then, to sample mirror
this frameless presentation of time-varying data in video applications.
Such
an approach can be implemented effectively as a selective read-out
system. Pixels are chosen for update based on their degree of change,
and all pixels within a fixed budget are updated for each captured
frame. For every frame, pixels are compared to their last output
value. Based on their difference from their last known representation,
they are enqueued for update. A client sits on the queue and dequeues
as many pixels for update as it is able to in the span of one frame's
capture. |
 |
This
algorithm can be implemented an adder and a priority queue. The
adder calculates the difference between each pixel and its last
output value at every frame. Each pixel is then priority-enqueued
according to this difference calculated by the adder. The logic
required for the per-pixel adder is negligible, but the priority
queue is potentially challenging to implement on-chip. A 5 megapixel
sensor at 60hz generates 300 million pixels/s which must be sorted
by the priority queue.
In
scenes with varying degrees of change, the queue load can be significantly
alleviated using a simple threshold implemented using a single comparator.
All pixels changes below the threshold are discarded to minimize
the queue size. The image quality change should be negligible, as
the threshold should eliminate only those pixels which would not
have been dequeued within the given update budget. Still, while
a simple threshold can greatly reduce the average size of the queue,
it cannot guarantee a maximum queue size.
To
achieve a consistently maintainable level of performance, the image
can be tiled into regions which are served by local queues. The
local queues have a guaranteed minimum performance corresponding
to a sort of all pixels in their tile, rather than all pixels in
the image. This tiled design offers the further advantage of scalability.
Rather than relying on a monolithic queue to sort all pixels in
the image, it employs smaller, fixed-size elements which can be
replicated to serve more tiles of similar size in higher resolution
image sensors.
|
 |
In
scenes with small degrees of change, this approach is able to achieve
3:1 compression of data with acceptable loss in quality [fig.
3]. However, for scenes with more moderate frame-to-frame changes,
it is unable to maintain high image quality at moderate data rates.
Maximum data loss due to the selective update process remains low,
but the approach fails simply because it produces particularly unpleasant
artifacts such as posterized regions with defined, pixilated edges.
A static enqueue threshold further exaggerates these artifacts by
causing unpleasant streaking in slow-moving low-contrast regions.
While
this frameless video approach to temporally adaptive sampling consistently
achieves at least some savings in all scenes, it is unable to offer
sufficient savings consistently while maintaining strong image quality
to warrant the complexity of implementation. The architecture of
the approach is straightforward, but the scale of the sorting problem
required to prioritize pixel update is inefficient to implement
in hardware given the modest bandwidth gains achieved in the worst
case.
What
the frameless video concept does suggest, however, is the potential
utility of in-sensor video compression. While the approach might
be useful with lower hardware requirements, the complexity of its
implementation, and yet its feasibility in near-future CMOS processes,
suggests that in-sensor MPEG compression at high resolutions may
soon be feasible.
Further,
while the perceptual quality of the images produced by the frameless
video method degrades rapidly as motion increases and pixel budgets
decrease, the relatively constant mathematical error of the sampling
method – bounded by the pixel update budget – coupled
with the transmission of data only in regions of change may lend
itself to vision systems where perceptual quality is unimportant
and processing power needs to be applied where it is needed most:
where the image is changing. |
|
 |
The
ADC performance of DPS systems is creating an explosion in the frame-rate
of digital photographic systems. Early DPS have been designed specifically
for their high temporal sampling rate. The Stanford Image Sensor
Laboratory has employed this high sampling rate to construct unique
sensors with extremely high dynamic range and video frame-rates
[1]. Others have suggested a range of applications which may be
enabled or greatly improved by high temporal sampling rates. This
then presents the challenge of maximizing use of the new wealth
of data while continuing to output a similar amount. |
 |
The
first critical application of a high-rate DPS achieved low-noise
high dynamic range capture through multiple exposure [1].
The system employed the high sampling rate of the DPS to capture
multiple exposures in the time previously required to digitize a
single exposure. However, while multi-exposure high dynamic range
capture traditionally yields dramatically lowered noise compared
to single exposure approaches, the multi-exposure approach is inherently
limited by the need to maintain alignment between multiple sequential
exposures. In real-world applications, this can require problematically
short integration times, which in turn yield extremely noisy images. |
 |
Additionally,
increased temporal sampling density has been shown to greatly improve
the accuracy of Optical Flow Estimation (OFE). The Lucas-Kanade
gradient-based approach scales uniquely well with temporal sampling
frequency. The precision of the solution increases proportionally
to the sampling frequency, while the computational intensity of
the solution is linear in the number of samples. SukHwan Lim recently
demonstrated an improved Lucas-Kanade approach designed for such
high sampling frequencies [4].
His custom, iterative implementation achieves approximately twice
the accuracy of the conventional Lucas-Kanade approach with similar
computational intensity. Lim proposes a feasible in-sensor hardware
implementation of his highly precise OFE engine in 130nm silicon
which would become simple and straightforward by the 100nm feature-size
generation. |
 |
As
Lim shows, we are able to achieve very accurate OFE with significant
integrated logic real estate to spare on a 100nm silicon CMOS DPS.
Utilizing some additional real-estate to implement an image warping
engine, our accurate optical flow estimate can then be applied to
the problem of maximizing the utility of our high frequency data.
Utilizing a Lucas-Kanade OFE across a multi-exposure time series,
I am able to accurately estimate the flow between sequential images
of similar exposure. I then use conventional methods to warp sequential
exposures along their respective flows to a common sample time to
register the images. From this reconstructed common time I can reconstruct
a single high dynamic range exposure using a conventional multi-exposure
approach, and output the data in a compact 32-bit RGBE format. |
 |
In
synthetic tests, this method achieves high quality, high dynamic
range images [fig. 4]. Some data is lost
in the reconstruction of the temporally interpolated exposure images.
This loss is visible in a slight blur in high detail regions. However,
it is largely insignificant in the reconstruction of high dynamic
range. The error represents a slight loss of spatial resolution,
but induces little to no loss in the accuracy of the high dynamic
range reconstruction as occurs with improperly registered exposures.
The error would grow as the OFE lost accuracy in fast-moving scenes.
However, the quality and motion tolerance of the registration could
be improved further by using the iterative extension to Lucas-Kanade
proposed by Lim. The flows from multiple exposures could be compared
to produce a dramatically more accurate sub-frame final flow estimate,
improved proportionally to the number of exposure samples per output
frame.
While
not clearly conventional compression, these results can be seen
to correspond to compression of high frame-rate data. By acquiring
accurate high dynamic range, it represents the compression of multiple
24-bit low dynamic range exposures to a single 32-bit high dynamic
range output. Similarly, high precision optical flow estimates can
be output along side standard frame-rate video to further maximize
the utility of high frame-rate capture.
|
|
 |
I
began my work intending to experiment with sensor technologies through
an elaborate and general sensor simulation framework. Aside from
representing an excessively large task in and of itself, this project
led me down a path of excessive unrelated debugging and development.
As my initial implementation progressed, I became increasingly aware
of undocumented missing or broken features in the ray-tracer framework
with which I was working, LRT. I wasted significant unnecessary
time attempting to debug the system when basic features which appeared
to be implemented turned out to be deprecated or broken without
notice.
As
it became clear that I was wasting time unnecessarily on this overly
complex system, I elected to employ ad hoc simulations of the various
algorithms which I wished to explore. However, I was then faced
with the surprising challenge of collecting or synthesizing useful
data. When addressing sampling of complex scenes, it is necessary
to work with realistically complex data in order to have any sense
of the real-world results which might be produced by the sensor
designs I explored. As realistic image synthesis continues to be
a central challenge in the field of computer graphics, such realistic
data proved extremely time consuming to make and to render, especially
at such high frame-rates as those required for some of my experiments.
All of the data which I was able to use -- photographic or synthetic
-- was my own work, aside from the geometry of the architectural
scene. Creating complex and interesting test scenes required a very
large amount of work, even from the existing architectural model.
Further, some of the data which I created had to represent a very
high dynamic range scene in order to effectively test high dynamic
range reconstruction. Setting up such a realistic, detailed, and
high dynamic range scene took a few days of work, even from an existing
model. Rendering the scene proved to be similarly challenging and
time-consuming, requiring days to render sufficiently dense sequences,
only to require several days of re-rendering after a correction
had to be made to the data.
I had
hoped to conclude my work with real-world data from a cutting-edge
DPS, the current El Gamal high frame-rate sensor. However, this
proved challenged more for reasons of bureaucratic disorganization
than anything technical. Every member of the group which developed
the sensor had a different notion of why it would or would not work
for my various applications, leading to constant vacillation about
actually acquiring data for my project. When I did finally set aside
a half-day to capture data, I was told upon arrival that the camera
system was broken and would be unavailable through the end of the
year. Generally indicative of my experiences throughout my attempts
to capture real-world data, the individuals responsible for the
camera system knew of the failure a week before my day of shooting,
but failed to tell me before I actually arrived in the lab.
Beyond
the challenges I faced in the synthesis and acquisition of useful,
sufficiently realistic data, I was set back for some time by my
own misconceptions of image sensor design and integration. As with
others in the graphics lab, I conceived of integration as the incorporation
of logic into the pixel. I failed to realize the flexibility afforded
by on-chip computational elements which resided outside of the image
plane, like the processing elements in a conventional CMOS product.
This misconception led me down a number of interesting paths which
ultimately proved to be irrelevant and unfounded in the real world.
Finally,
while I was able to design useful adaptive sampling systems which
generated moderately interesting results, above all I discovered
through my investigation into their feasibility that they may have
no useful application. While both the spatial and temporal sampling
methods I explored are feasible to implement in hardware, they will
likely be surpassed by full-fledged in-sensor image and video compression
before they become useful or feasible.
However,
the concept of optimal off-sensor bandwidth usage in the face of
ever-increasing volumes of image data found great application in
my seemingly tangential registered high dynamic range capture method.
It was the practical shortcomings of the strict adaptive sampling
methods I explored, and the sensor technology which I learned in
exploring them, which led me to rethink my over-arching goal to
that of generally efficient bandwidth utilization amid a swelling
sea of image data. |
|
 |
For
their tremendous insight and willingness to help me in my relative
ignorance of electrical engineering, I am grateful to Ting Chen
and SukHwan Lim of the Stanford Image Sensor Lab. For initially
advising me on image sensor technology, I thank Bennett Wilburn.
And finally for bringing me through two rounds of project advice,
and for teaching me so much about cameras and photography, I thank
Steve Marschner and Marc Levoy. |
|