Lecture 02.1 – Image Formation

Lecture Slides: Image Formation

Image formation is the process by which a three-dimensional (3D) scene is transformed into a two-dimensional (2D) image. Understanding this process is fundamental for both interpreting images (as in computer vision) and synthesizing images (as in computer graphics). At its core, image formation involves projection – the mapping of 3D points in the world to 2D points on an image plane. This projection is governed by the physics of light and the geometry of the imaging system.

1. Historical Developments in Image Formation

Understanding how images are formed has been a pursuit spanning centuries, intertwining physics, art, and mathematics.

Ancient and Medieval Optics – Camera Obscura

The basic principle of image formation by a pinhole (camera obscura) was known in antiquity. Early records suggest that around the 5th century B.C., the Chinese philosopher Mozi and later Aristotle were aware of the camera obscura effect, where light from a scene passing through a small hole projects an inverted image on a surface. However, it was the medieval Arab scientist Ibn al-Haytham (Alhazen, 965–1040 AD) who provided the first detailed description. In his Book of Optics (~1021 AD), Ibn al-Haytham used the camera obscura to study light and vision, introducing the concept that vision results from light entering the eye rather than emanating from it.

Renaissance Perspective and Geometry

A major leap occurred in the early 15th century with the formalization of linear perspective. Artists and architects sought a systematic way to represent 3D scenes on 2D canvases realistically. Filippo Brunelleschi is credited with demonstrating perspective around 1413. By observing a scene through a small hole in a painted panel and reflecting the view in a mirror, he proved that parallel lines in the world converge to a vanishing point on the horizon line in the painting.

Shortly thereafter, Leon Battista Alberti published On Painting (1435), providing a more formal framework. This framework used the concept of a visual pyramid and an intersection of rays on a picture plane. Artists like Masaccio and Leonardo da Vinci refined these techniques, and projective geometry later gave perspective a rigorous mathematical underpinning.

Early Cameras and Photographic Imaging

The camera obscura remained a tool for artists into the Renaissance, often enhanced with lenses for brightness. The leap to permanent image capture came in 1826 when Joseph Nicéphore Niépce produced the first known photograph on a pewter plate (heliography). Although exposures were extremely long, it proved that images formed by a camera obscura could be preserved chemically.

Subsequent improvements by Louis Daguerre (daguerreotype, 1839) and William Henry Fox Talbot (calotype) reduced exposure times and introduced practical photographic processes. Throughout the 19th century, optical scientists refined lens design to reduce aberrations, enabling sharper images.

Modern Developments

In the 20th century, photogrammetry and later computer vision formalized the pinhole camera as a standard mathematical model of image formation. Early work in the 1960s introduced matrix methods and homogeneous coordinates to handle perspective projection in computers.

From the late 20th century onward, camera sensors shifted from film to digital (CCD, CMOS). With images as digital data, computational photography emerged. These developments reflect the continued evolution of image formation from a purely geometric/optical process to one in which significant software post-processing complements the hardware.

2. The Pinhole Camera Model

A pinhole camera is essentially a light-tight box with a small aperture (pinhole) through which light from a scene projects onto the opposite side. This simple setup captures the fundamental mapping from 3D to 2D. Real cameras with lenses behave similarly (lenses just gather more light), so the pinhole model remains valid as the base geometric framework.

Coordinate Setup

We place the pinhole (center of projection) at the origin of a 3D coordinate system. The optical axis aligns with the \(Z\)-axis, and the image plane is at \(Z = f > 0\), perpendicular to the axis. A point \(P = (X, Y, Z)\) in 3D is projected to \((x, y)\) on the image plane by drawing the ray from \((0,0,0)\) through \((X,Y,Z)\). Where that ray meets the plane \(Z=f\), we get the image coordinates. By similar triangles:

\[x = f \frac{X}{Z}, \quad y = f \frac{Y}{Z}.\]

Thus, perspective projection is defined by dividing by the depth \(Z\).

Proof by Similar Triangles

Consider the line from \((0,0,0)\) to \((X, Y, Z)\). In parametric form, it is:

\[L(t) = t (X, Y, Z).\]

We want the intersection with the plane \(Z = f\). Imposing \(t Z = f \Rightarrow t = f/Z\). Hence:

\[\bigl(x, y, f\bigr) = \bigl(t X,\; t Y,\; t Z\bigr) = \Bigl(\frac{fX}{Z}, \frac{fY}{Z}, f\Bigr).\]

Ignoring the \(z = f\) component, we obtain \(x = fX/Z\) and \(y = fY/Z\).

Numerical Example

Let \(f = 1\,\mathrm{(unit)}\), and a 3D point \((X,Y,Z)=(2,3,10)\). Then:

\[x = 1 \times \frac{2}{10} = 0.2, \quad y = 1 \times \frac{3}{10} = 0.3.\]

If we change \(Z\) to \(5\), the image point doubles to \((0.4, 0.6)\), illustrating that objects closer to the pinhole appear larger.

Inadequacy of a Simple Pinhole

The pinhole camera model in geometry assumes an infinitesimally small hole to get a perfectly sharp image projection. In practice, a very tiny pinhole lets through very little light, yielding a dim image. If one enlarges the pinhole to admit more light, the image blurs because rays from different points start overlapping at the image plane.

This inherent trade-off (sharpness vs. brightness) is why real cameras use lenses rather than tiny pinholes. A lens of finite aperture can gather more light while still focusing rays from a given scene point to one image point.

3. Camera Intrinsics and the Projection Matrix

So far, we have \((x, y)\) in a continuous coordinate system. Real cameras record pixels, with an origin usually at the image’s corner and a certain pixel size. Let \((u,v)\) be pixel coordinates. We define:

\(s_x, s_y\): number of pixels per physical unit (e.g., mm) along \(x,y\).
\((c_x, c_y)\): location in pixel coordinates where the optical axis intersects the image plane (principal point).
\(\alpha\): skew parameter (usually \(0\) for standard cameras).

Thus:

\[u = s_x\, x + c_x, \quad v = s_y\, y + c_y,\]

with optional skew term. These parameters form the intrinsic matrix \(K\) (often \(3 \times 3\)). For example (zero skew):

\[\begin{split}K = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix},\end{split}\]

where \(f_x = s_x\,f\) and \(f_y = s_y\,f\). In homogeneous coordinates, the 2D point \((u,v,1)^T\) is \(K\, (X,Y,Z)^T / Z\).

Extrinsic Parameters

A camera also has a position (translation \(\mathbf{t}\)) and orientation (rotation \(R\)) relative to a world coordinate system. Hence, a point in world coordinates \((X_w, Y_w, Z_w)\) first transforms to camera coordinates:

\[(X_c, Y_c, Z_c) = R\, (X_w, Y_w, Z_w) + \mathbf{t},\]

and then projects via:

\[(x,y) = (fX_c/Z_c,\; fY_c/Z_c).\]

Combining with the intrinsic matrix, the camera projection matrix \(P\) is:

\[P = K \,[R \mid \mathbf{t}], \quad \text{size is } 3\times4.\]

Thus, for a homogeneous world coordinate \(X_w\), its image in homogeneous pixel coordinates is \(x_{image} = P \, X_w\).

Full Projection Example

Consider a simple scenario where the world coordinate system coincides with the camera coordinate system (so \(R=I\), \(t=0\)), the focal length is \(f=1\) (normalized units), and the image sensor has 1000 pixels per unit with principal point at the center.

Then the intrinsic matrix is:

\[\begin{split}K = \begin{pmatrix} 1000 & 0 & 0 \\ 0 & 1000 & 0 \\ 0 & 0 & 1 \end{pmatrix}.\end{split}\]

For our 3D point \((X, Y, Z) = (2, 3, 10)\), the projection in normalized sensor units was \((0.2, 0.3)\). Converting to pixel coordinates:

\[\begin{split}u = 1000 \times 0.2 + 0 = 200 \text{ px},\\ v = 1000 \times 0.3 + 0 = 300 \text{ px}.\end{split}\]

4. Image Distortions & Correction

Real lenses introduce radial distortion (barrel/pincushion) and possibly tangential distortion. A common radial model is:

\[\begin{split}x_d = x \,(1 + k_1\,r^2 + k_2\,r^4 + k_3\,r^6),\\ y_d = y \,(1 + k_1\,r^2 + k_2\,r^4 + k_3\,r^6),\end{split}\]

where \(r^2 = x^2 + y^2\). Tangential distortion involves additional terms.

These distortions result in two main visible effects:

Barrel distortion: Negative \(k_1\) causes the image to “bulge” outward (straight lines bow outward)
Pincushion distortion: Positive \(k_1\) makes the image “pinch” inward (straight lines bow inward)

Calibration software (e.g., OpenCV) solves for these coefficients, so we can undistort images. This is critical for precision vision applications where accurate geometry is required.

5. Properties of Perspective Projection

Perspective projection has several important geometric properties:

Straight lines project to straight lines: Except for lines passing through the camera center (which project to a single point, a vanishing point), projective geometry ensures that the image of a line is still a line.

Proof sketch: A 3D line can be parameterized as \(P(t) = P_0 + t\mathbf{d}\), where \(P_0=(X_0,Y_0,Z_0)\) is a point on the line and \(\mathbf{d}=(a,b,c)\) is its direction vector. The projection of this line yields an equation that simplifies to a linear equation in the image plane.
Vanishing points: Parallel lines in 3D that are not parallel to the image plane appear to converge at a vanishing point. This point is where a set of parallel lines appears to meet at infinity.

Any set of parallel lines within the scene will converge to a point on the horizon line in the image. The horizon line itself is the vanishing line of the ground plane.
Planes to homographies: Points on a plane in 3D project via a 2D projective transformation (homography). Thus, a planar surface can be unwarped from its image if the plane’s equation and camera parameters are known.
Field of View (FOV): The focal length \(f\) and sensor size determine the camera’s field of view. A small \(f\) yields a wide FOV (wide angle), and a large \(f\) yields a narrow FOV (telephoto):

\[\mathrm{FOV} \approx 2\,\arctan\Big(\frac{\text{sensor width}}{2f}\Big).\]

For example, a full-frame camera sensor (~36 mm wide) with a 35 mm focal length lens has \(\mathrm{FOV}_h \approx 2 \arctan(18/35) \approx 53^\circ\). If we increase \(f\) to 70 mm, \(\mathrm{FOV}_h \approx 28^\circ\) – a much tighter view.
Depth ambiguity: Single-view projection collapses the depth dimension; many points along the same ray in 3D have the same 2D image. Therefore, recovering 3D from one image is ill-posed without additional priors or multiple views.

6. Advanced Theoretical Extensions

While the pinhole model is a cornerstone, many modern imaging systems go beyond it.

Light Field Imaging and Plenoptic Cameras

A light field is a 4D function encoding the intensity of rays in free space, parameterized by position and direction. The concept stems from the plenoptic function of Adelson and Bergen, which includes spatial, angular, spectral, and temporal dimensions of light.

Plenoptic Cameras: By placing a micro-lens array near the sensor (behind a main lens), each micro-lens captures different incident angles locally. Thus, the recorded data is a 4D light field: two coordinates for the micro-lens array position and two for sub-pixel direction. This enables post-capture refocus, slight viewpoint shifts, and other computational possibilities.

Non-Conventional Imaging Techniques

Compressive Imaging: A single-pixel camera uses a spatial light modulator (DMD) to sample the scene with known patterns, measuring the integrated intensity on one detector. This is useful in wavelengths (like infrared) where large pixel arrays are difficult or expensive.
Non-Line-of-Sight (NLOS) Imaging: By emitting laser pulses and measuring time-of-flight echoes with ultrafast detectors, one can reconstruct 3D shapes hidden around corners. The wall acts as a diffuse relay.
Coded Apertures, Wavefront Engineering: Various mask or aperture-coding schemes modulate the incoming light so the captured data is more invertible for tasks like deblurring, depth extraction, or spectral imaging.

7. Applications in Modern Vision and Graphics

Computer Vision and 3D Reconstruction

Camera Calibration: One estimates the intrinsic matrix \(K\) and extrinsic parameters by observing known 3D points and their 2D projections. Many toolboxes implement this (e.g., OpenCV), typically solving for the projection matrix \(P = K[R|t]\).
Stereo / Multi-view: With multiple images from different viewpoints, one triangulates 3D points by intersecting rays. Epipolar geometry (fundamental or essential matrix) arises directly from the pinhole model.
Pose Estimation: Given correspondences between known 3D points and 2D detections, we solve the PnP (Perspective-n-Point) problem. This is central to augmented reality and robotics.
Structure from Motion (SfM): With multiple images of unknown camera poses, both camera parameters and 3D structure can be iteratively estimated by minimizing reprojection error (bundle adjustment).

Medical Imaging

X-ray and CT: X-ray radiographs are essentially projection images from a point source (like a pinhole). CT collects multiple such projections to invert the line integrals of X-ray attenuation, reconstructing cross-sectional volumes.
MRI, Ultrasound: Although their physics differ (spin resonance, acoustic echoes), each has a forward image formation model, inverted computationally to yield a final image.

Photorealistic Rendering in Computer Graphics

Rendering is the inverse of vision: given a 3D scene, we compute which rays reach each pixel. Methods like ray tracing sample rays from a virtual pinhole or thin lens through each pixel into the scene, recursively simulating bounces and lighting (using the rendering equation):

\[L_o(x,\omega) = L_e(x,\omega) + \int_{\text{Hemi}} f_r(x,\omega',\omega) L_i(x,\omega') (\omega' \cdot n) d\omega'\]

By accurately modeling geometry, materials, and camera parameters (lens aperture, focal length, etc.), we can recreate physically plausible images. Modern “physically based rendering” aims for realism by faithfully implementing optical laws.

8. Python Example: Simulating Image Formation

Below is a minimal snippet that demonstrates a pinhole camera pipeline:

import numpy as np

# Intrinsic parameters
f = 800  # focal length in pixels
cx, cy = 320, 240  # principal point (image center)
K = np.array([[f,   0, cx],
              [0,   f, cy],
              [0,   0,  1]])

# Extrinsics: identity rotation & zero translation
R = np.eye(3)
t = np.zeros(3)

# Some 3D points (corners of a square at Z=5)
points_3d = np.array([
    [-1.0, -1.0, 5.0],
    [ 1.0, -1.0, 5.0],
    [ 1.0,  1.0, 5.0],
    [-1.0,  1.0, 5.0]
])

# Transform into camera coords
points_cam = (points_3d @ R.T) + t

# Project to image (homogeneous)
points_img_h = points_cam @ K.T
points_img = points_img_h[:, :2] / points_img_h[:, 2, None]
print(points_img)

Running this code yields 2D pixel coordinates of the projected 3D square. Try adjusting \(Z\) or \(f\) to see perspective effects.