3D Gaussian Splatting: A Basic Introduction

Introduction

3D Gaussian Splatting represents a breakthrough in novel view synthesis, offering real-time rendering of photorealistic scenes by combining the high quality of neural radiance fields with the efficiency of traditional rendering techniques. This approach has revolutionized the field by enabling photo-realistic rendering at interactive frame rates while maintaining the ability to represent complex geometry and appearance. Unlike previous approaches to novel view synthesis that rely heavily on neural networks to implicitly encode scene information, 3D Gaussian Splatting uses an explicit representation with millions of 3D Gaussian primitives that can be efficiently rendered through a specialized pipeline. This representation strikes an optimal balance between quality and performance, making it suitable for applications ranging from virtual reality to digital human creation. This guide provides a comprehensive overview of 3D Gaussian Splatting, from foundational concepts to practical implementation. We’ll explore the theoretical underpinnings, mathematical formulations, algorithmic details, and real-world applications of this powerful technique.

Foundations

What is a 3D Scene?

A 3D scene consists of objects, lights, and cameras arranged in three-dimensional space:

  • Geometry: The shapes and positions of objects in the scene

  • Material Properties: How surfaces interact with light (color, reflectance, transparency)

  • Lighting: The illumination of the scene from various light sources

  • Camera: The viewpoint from which the scene is observed

The goal of 3D scene reconstruction and rendering is to capture or create these elements and visualize them from arbitrary viewpoints.

3D Scene Representation

In computer graphics and vision, a 3D scene representation is a way to model the geometry and appearance of real or virtual environments. Common representations include:

  • Polygonal Meshes: Use vertices and faces to approximate surfaces; efficient for rendering on GPUs but can struggle with complex topology.

  • Voxel Grids: Discretize space into tiny cubes (like 3D pixels) storing color or density, enabling volumetric effects but often requiring high memory for fine detail.

  • Point Clouds: Represent scenes as unconnected points in space with attributes (color, normal, etc.), which are simple and flexible but typically produce sparse renderings with holes.

  • Neural Implicit Representations: Encode scenes in the weights of neural networks or continuous functions, yielding smooth interpolation but requiring inference for rendering.

3D Gaussian Splatting falls between point clouds and volumetric fields: it represents the scene as a cloud of 3D Gaussian density distributions (ellipsoids) that are continuous and overlapping, providing a smooth coverage of space without a fixed grid structure.

Computer Graphics Fundamentals

To understand Gaussian Splatting, we need to review some fundamental concepts in computer graphics:

  1. 3D Representation: Objects in 3D space can be represented in various ways: * Meshes: Collections of vertices, edges, and faces (typically triangles) * Point Clouds: Sets of points in 3D space * Volumetric Representations: Voxel grids or continuous functions defining density and color

  2. Camera Models: A camera defines how 3D points are projected onto a 2D image: * Intrinsics: Focal length, principal point, etc. * Extrinsics: Position and orientation of the camera in 3D space

  3. Rendering: The process of generating a 2D image from a 3D scene description

To appreciate Gaussian splatting, we must also understand coordinate transformations. In a typical rendering pipeline, a virtual camera with an intrinsic model (focal length, image plane) and extrinsic pose (position and orientation) captures the scene. Coordinates transform from world space (the scene’s coordinate system) to camera/view space, and finally to screen space via a projection (usually perspective projection).

Rasterization and Ray Tracing

Traditional rendering methods fall into two main categories:

Rasterization:

The process of converting vector graphics (like triangles) into a raster image (pixels on a screen). This involves:

  1. Transforming 3D vertices to screen space

  2. Determining which pixels are covered by each primitive

  3. Computing the color for each pixel based on material properties and lighting

Rasterization is efficient and forms the basis of real-time graphics in games and interactive applications.

Ray Tracing:

A technique that simulates the physical behavior of light by:

  1. Casting rays from the camera through each pixel

  2. Computing intersections with scene geometry

  3. Recursively following reflection, refraction, and shadow rays

Ray tracing produces more physically accurate images but has traditionally been more computationally expensive.

3D Gaussian Splatting uses a hybrid approach. It treats each 3D Gaussian as a primitive that can be rasterized by projecting it to an ellipse on the screen, but then it blends these contributions along each ray akin to ray tracing a volume. In effect, it performs a sort of splatting rasterization where continuous Gaussian blobs are drawn with transparency and combined in the correct back-to-front order. This approach leverages the speed of rasterization (by drawing many points quickly) while achieving the quality of ray-traced volume integration (by alpha blending accumulative contributions).

Alpha Blending and Compositing

When rendering transparent or semi-transparent objects, alpha blending is used to combine colors:

\[C_{final} = \alpha_{src} \cdot C_{src} + (1 - \alpha_{src}) \cdot C_{dst}\]

Where: - \(C_{src}\) is the source color (new fragment) - \(C_{dst}\) is the destination color (existing pixel) - \(\alpha_{src}\) is the opacity of the source (0 = transparent, 1 = opaque)

This “over” operator is fundamental to compositing multiple transparent elements in the correct order. In the context of rendering a set of semi-transparent elements (like Gaussian splats or volume samples), the goal is to compute the pixel color as if light traveled through the elements, picking up color and attenuating as it goes.

The compositing equation for front-to-back rendering is:

\[C_{\text{out}} = C_{0} + C_{1}(1-\alpha_{0}) + C_{2}(1-\alpha_{0})(1-\alpha_{1}) + \dots\]

Here each layer (0,1,2,…) has a color C and opacity α, and layers are ordered from nearest to farthest from the viewer. The nearest layer contributes its full color C₀ weighted by its opacity α₀. The next layer’s contribution C₁ is reduced by (1-α₀), meaning it only contributes in the portion not occluded by the first layer, and so on.

In a continuous form (for volume rendering), the color seen along a ray can be written as an integral of the attenuated radiance:

\[I = \int_{0}^{L} T(t)\, \sigma(t)\, c(t)\, dt, \quad \text{with} \quad T(t) = \exp\!\Big(-\int_{0}^{t} \sigma(s) ds\Big),\]

where \(\sigma(t)\) is the density (opacity per unit length) at distance \(t\) along the ray, \(c(t)\) is the color or radiance at that point, and \(T(t)\) is the accumulated transparency (transmittance) up to that point. This is the volumetric rendering equation used in radiance field methods like NeRF. Gaussian splatting adheres to the same image formation model as NeRF, meaning it blends contributions using an equivalent of this volume rendering integral but in a discretized manner by summing many Gaussian “particles” along the ray.

The Evolution of Novel View Synthesis

Novel view synthesis is the task of generating new views of a scene from limited input images. This field has evolved dramatically over the years.

Image-Based Rendering

Early approaches to novel view synthesis focused on interpolating between captured images:

  • Light Fields (Levoy & Hanrahan, 1996): Represented a scene as a 4D function of light rays

  • Lumigraphs (Gortler et al., 1996): Combined sparse geometric information with dense image sampling

  • Unstructured Lumigraph (Buehler et al., 2001): An image-based rendering method that handles sparse view input by leveraging a mesh proxy

These methods required dense sampling of viewpoints, limiting their practical applications. They stored many input views and interpolated between them.

Structure-from-Motion and Multi-View Stereo

To address the limitations of pure image-based methods, researchers developed techniques to recover 3D structure:

  • Structure-from-Motion (SfM): Recovers camera poses and a sparse 3D point cloud from multiple images. As capturing geometry became easier, approaches produced point clouds or meshes of the scene, enabling geometry-based view synthesis.

  • Multi-View Stereo (MVS): Densifies the sparse point cloud to create more complete 3D models

These classical computer vision approaches provided more geometric understanding but often struggled with complex materials and lighting effects.

Point-Based Rendering

Point-based rendering emerged as a simple way to render scenes captured as point clouds (e.g., from MVS or depth sensors) without mesh connectivity. However, naive point rendering shows gaps and aliasing. Research in the 2000s introduced splatting—rendering each point as a disk or Gaussian blob (also called surfel if it has a normal and texture) to cover holes and smooth the result.

Notable developments include: - Surfels (Pfister et al., 2000): Introduced as a rendering primitive for scanned surfaces - EWA Surface Splatting (Zwicker et al., 2001): Developed to properly filter and blend point contributions on screen

These laid the groundwork for representing scenes by clouds of particles instead of connected triangles.

Neural Rendering

The advent of deep learning spurred a new wave of novel view synthesis:

  • Neural Point-Based Graphics (NPBG) (Aliev et al., 2020): Combined point clouds with learned neural descriptors to improve fidelity

  • Neural Radiance Fields (NeRF) (Mildenhall et al., 2020): Showed that optimizing an MLP to represent volume density and color yields photorealistic novel views

NeRF marked a paradigm shift by representing a scene as a continuous function modeled by a neural network:

\[F_\Theta(x, d) \mapsto (c, \sigma)\]

Where: - \(x \in \mathbb{R}^3\) is a 3D position - \(d \in \mathbb{R}^2\) is a viewing direction - \(c \in \mathbb{R}^3\) is the emitted color (RGB) - \(\sigma \in \mathbb{R}\) is the volume density

NeRF renders images using volumetric integration along camera rays, achieving unprecedented quality for novel view synthesis. However, it suffers from slow rendering times due to the need to evaluate the neural network for many points along each ray.

Accelerated Neural Fields

NeRF sparked a revolution: numerous variants improved quality and speed over the next few years:

  • Mip-NeRF and Mip-NeRF 360 (Barron et al., 2021, 2022): Addressed aliasing and unbounded scenes

  • Plenoxels (Fridovich-Keil et al., 2022): Removed the neural network, optimizing explicit grids of voxels with densities and colors

  • Instant NGP (Müller et al., 2022): Used a multiresolution hash grid encoding to drastically speed up training and rendering

These voxel-based radiance fields achieve fast training but still face memory-resolution trade-offs and need interpolation during ray marching.

3D Gaussian Splatting: A Convergence of Approaches

3D Gaussian Splatting (Kerbl et al., 2023) can be seen as a convergence of these directions. It brings back an explicit point-based representation (like the classic surfel idea) but imbues it with volumetric rendering principles (like NeRF’s alpha blending). Crucially, it optimizes the scene in situ by gradient descent, similar to neural fields, making it a differentiable point-based representation.

The result is a method that can be trained rapidly (minutes) like the fastest radiance field methods, yet renders in real-time with quality comparable to slower neural networks. This bridges the gap between neural and point-based novel view synthesis: millions of tiny Gaussian primitives optimized to reproduce input images, achieving state-of-the-art novel view synthesis results with unprecedented speed.

Point-Based Rendering

Point Clouds and Their Challenges

Point clouds provide a simple and flexible representation of 3D geometry:

  • Each point has a position in 3D space and optional attributes (color, normal, etc.)

  • No explicit connectivity information (unlike meshes)

  • Can be directly acquired from sensors like LiDAR or derived from MVS

However, rendering raw point clouds presents challenges: - Gaps between points can result in holes or artifacts - Point size is not inherently defined - Handling occlusion requires careful ordering

The Concept of Splatting

Splatting addresses these challenges by representing each point as a small surface element or “splat”:

  1. Project each 3D point onto the image plane

  2. Render each point as a small 2D footprint (disk or ellipse)

  3. Blend these footprints together to form a continuous surface

The size and shape of each splat can be adjusted based on local density, viewing angle, and surface properties.

Elliptical Weighted Average (EWA) Filtering

EWA filtering, introduced by Zwicker et al. (2001), improves the quality of point-based rendering:

  1. Represent each point as a 3D Gaussian ellipsoid

  2. Project this Gaussian to the image plane, resulting in a 2D Gaussian ellipse

  3. Apply filtering to prevent aliasing

  4. Blend the filtered footprints to create the final image

This approach provides: - A 3D Gaussian kernel in object space is projected to screen space - The projected kernel is combined with a low-pass filter to prevent aliasing - This combined filter function (the EWA filter) is used for each splat’s contribution

Differentiable Point-Based Rendering

Recent advances have made point-based rendering differentiable, enabling optimization of point attributes through backpropagation:

  • Differentiable Surface Splatting (Wang et al., 2019): Made the splatting pipeline differentiable, allowing gradients to flow back to point positions and attributes

  • Neural Point-Based Graphics (Aliev et al., 2020): Combined traditional point rendering with neural networks

  • ADOP (Rückert et al., 2022): Provided an approximate differentiable one-pixel point rendering approach

  • Pulsar (Lassner & Zollhöfer, 2021): Created an efficient sphere-based neural rendering method

These methods lay the groundwork for learning-based approaches like 3D Gaussian Splatting by providing differentiable rendering that can compute gradients with respect to point attributes.

3D Gaussian Splatting: Core Principles

3D Gaussian Splatting builds upon both volumetric rendering concepts from NeRF and point-based rendering techniques.

Key Insight: Unifying Points and Volumes

The fundamental insight of 3D Gaussian Splatting is that:

  1. Both volumetric rendering (as in NeRF) and alpha blending of point splats follow the same mathematical model

  2. By representing a scene as a collection of 3D Gaussians instead of a neural network, rendering can be performed efficiently

  3. The Gaussian representation is differentiable, allowing optimization through gradient descent

This unified view enables high-quality rendering with real-time performance.

3D Gaussians as Scene Primitives

In 3D Gaussian Splatting, each scene element is represented as an anisotropic 3D Gaussian:

\[G(x) = \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)\right)\]

Where: - \(\mu \in \mathbb{R}^3\) is the center (mean position) - \(\Sigma \in \mathbb{R}^{3 \times 3}\) is the covariance matrix, defining the Gaussian’s shape and orientation - For an isotropic Gaussian, \(\Sigma = \sigma^2 I\) (a sphere) - For an anisotropic Gaussian, \(\Sigma\) can represent any ellipsoid

Each Gaussian also carries: - An opacity parameter controlling its contribution to the final image - Color information, either as a constant RGB value or view-dependent (via spherical harmonics)

A Gaussian in 3D is a function defined by a mean (center position) and a covariance matrix (which describes its spatial extent in each direction). The covariance matrix \(\Sigma\) is symmetric positive semi-definite. We can interpret \(\Sigma\) as defining an ellipsoid surface where \((\mathbf{x}-\mathbf{p})^T \Sigma^{-1} (\mathbf{x}-\mathbf{p}) = 1\). The eigenvalues of \(\Sigma\) give the squared lengths of the ellipsoid’s principal axes, and the eigenvectors give the orientation of those axes.

The Volumetric Rendering Equation

The volumetric rendering equation computes the color of a pixel by integrating contributions along a camera ray:

\[C(r) = \int_{t_{min}}^{t_{max}} T(t) \cdot \sigma(r(t)) \cdot c(r(t), d) \, dt\]

Where: - \(r(t) = o + t \cdot d\) is the position along the ray - \(T(t) = \exp\left(-\int_{t_{min}}^{t} \sigma(r(s)) \, ds\right)\) is the transmittance - \(\sigma(r(t))\) is the density at position \(r(t)\) - \(c(r(t), d)\) is the color at position \(r(t)\) viewed from direction \(d\)

In practice, this integral is approximated as a discrete sum:

\[C \approx \sum_{i=1}^{N} T_i \cdot (1 - e^{-\sigma_i \delta_i}) \cdot c_i\]

Where: - \(T_i = \prod_{j=1}^{i-1} (1 - \alpha_j)\) with \(\alpha_j = 1 - e^{-\sigma_j \delta_j}\) - \(\delta_i\) is the distance between adjacent samples

An important insight is that Gaussian splatting uses the same optical model as NeRF and other radiance fields. That is, it assumes the color seen by the camera is the integral of emitted radiance times transparency. No shadowing or global illumination is considered; each Gaussian contributes independently to the rays that intersect it.

Alpha Compositing with Gaussians

In 3D Gaussian Splatting, the same rendering equation is implemented using alpha compositing of Gaussian splats:

\[C = \sum_{i=1}^{N} \left(\prod_{j=1}^{i-1}(1-\alpha_j)\right) \alpha_i c_i\]

Where: - Gaussians are sorted front-to-back - \(\alpha_i\) is derived from the Gaussian’s opacity and its projected 2D footprint - \(c_i\) is the Gaussian’s color (potentially view-dependent) - \(\prod_{j=1}^{i-1}(1-\alpha_j)\) accounts for occlusion by closer Gaussians

Alpha compositing with many Gaussians follows the same rules as described in the alpha blending section. However, one challenge is that there may be millions of Gaussians, and they are not ordered in any simple way like layers. For proper rendering, we need to:

  1. Project all Gaussians to 2D (obtaining their elliptical footprint and an effective per-pixel alpha mask)

  2. Sort all these “splats” by depth (distance from camera)

  3. Blend them back-to-front (furthest first) or front-to-back into the image

For front-to-back compositing (which is numerically stable and easy to implement incrementally):

\[\begin{split}C_{\text{accum}} \leftarrow C_{\text{accum}} + (1 - \alpha_{\text{accum}})\, \alpha_i\, C_i, \\ \alpha_{\text{accum}} \leftarrow \alpha_{\text{accum}} + (1 - \alpha_{\text{accum}})\, \alpha_i,\end{split}\]

where \(C_{\text{accum}}\) and \(\alpha_{\text{accum}}\) are the running accumulated color and opacity, and \(C_i,\alpha_i\) are the next splat’s color and opacity.

Mathematical Formulation of 3D Gaussian Splatting

Let’s delve into the mathematical details of 3D Gaussian Splatting.

Projecting 3D Gaussians to 2D

When rendering, each 3D Gaussian is projected to a 2D elliptical splat on the image plane:

  1. The center \(\mu\) is projected to the image using the camera projection matrix

  2. The 3D covariance \(\Sigma\) is transformed to a 2D covariance \(\Sigma_{img}\) in image space:

    \[\Sigma_{img} = J \cdot W \cdot \Sigma \cdot W^T \cdot J^T\]

    Where: - \(W\) is the view transformation matrix - \(J\) is the Jacobian of the projection at the Gaussian’s center

  3. The resulting 2D Gaussian defines the shape and extent of the splat on the image

Projecting a 3D Gaussian onto the image plane yields a 2D Gaussian (an ellipse) in screen space. However, because the camera uses a perspective projection (not a linear mapping), projecting a Gaussian exactly is non-trivial. A common approach is to approximate the perspective transform locally as an affine transform around the Gaussian’s center. This yields a linear mapping of the covariance.

Let’s derive the projection formula for a 3D Gaussian’s covariance to the screen (image) plane in a simplified scenario. Assume we have a pinhole camera. Let a 3D Gaussian have world covariance \(\Sigma\). We want the covariance of the image projection.

Step 1: Transform to camera coordinates. Apply the rigid transformation (rotation \(R_c\) and translation \(t_c\) of the camera pose) to the Gaussian. In linear approximation, the covariance in camera frame (before projection) is \(\Sigma_c = R_c \Sigma R_c^T\) (we ignore translation for covariance). This aligns the Gaussian with the camera’s view axes.

Step 2: Project to image. For an orthographic camera (no perspective foreshortening), projection is just dropping the z-coordinate, and indeed the covariance on the image would be the top-left 2x2 block of \(\Sigma_c\). For a perspective camera, we consider an object point at depth \(Z\) projects with scale \(f/Z\) (where \(f\) is focal length). If the Gaussian has a small extent, we linearize around depth \(Z_0\) (the depth of the Gaussian center).

Step 3: Simplify for usage. The final result used is:

\[\Sigma' = J W\, \Sigma\, W^T J^T,\]

with \(\Sigma'\) being a \(3\times3\) matrix in camera coords, and then using the upper-left \(2\times2\) block as the image covariance.

The key take-away is that an anisotropic 3D Gaussian remains anisotropic in the image – usually an ellipse. The formula provides a way to compute that ellipse’s shape from the 3D parameters. We can thus rasterize each Gaussian as an oriented ellipse with appropriate size.

Parameterization of 3D Gaussians

To ensure that the Gaussian representation remains valid during optimization, the covariance matrix \(\Sigma\) is parameterized as:

\[\Sigma = R \cdot S \cdot S^T \cdot R^T\]

Where: - \(R\) is a rotation matrix (often represented as a quaternion) - \(S\) is a diagonal scaling matrix with entries \(s_x, s_y, s_z\)

This decomposition guarantees that \(\Sigma\) remains positive semidefinite throughout optimization.

Directly optimizing the \(3\times3\) covariance matrices \(\Sigma\) for potentially millions of Gaussians is tricky, because we must ensure each \(\Sigma\) stays valid (symmetric positive semi-definite). Gradient descent could easily nudge some matrix to become non-PSD, causing the Gaussian to become invalid (e.g., negative variance in some direction).

This parameterization ensures any combination of parameters yields a valid covariance (since \(R\) is orthonormal and \(S S^T\) is positive semidef.). During optimization, the quaternion representing \(R\) is normalized after each gradient update to remain a valid rotation.

View-Dependent Appearance

To model view-dependent effects like specular highlights, each Gaussian can use spherical harmonics to encode color as a function of viewing direction:

\[c(d) = \sum_{l=0}^{L} \sum_{m=-l}^{l} c_{lm} Y_{lm}(d)\]

Where: - \(Y_{lm}(d)\) are the spherical harmonic basis functions - \(c_{lm}\) are the coefficients (typically using up to 2nd order, resulting in 9 coefficients per color channel)

This allows each Gaussian to exhibit different colors when viewed from different angles, similar to NeRF’s view-dependent modeling.

Differentiable Rendering Equations

A major strength of 3D Gaussian Splatting is that the rendering process is made differentiable so that gradient-based optimization can be used to adjust the Gaussians to fit input images. Differentiating the rendering with respect to Gaussian parameters means computing partial derivatives of the final pixel colors with respect to each Gaussian’s position, covariance, color, and opacity.

The rendering function can be denoted as \(I = R(\{p_i, \Sigma_i, c_i, \alpha_i\})\) producing an image \(I\). We have a loss \(L = \sum_{\text{pixels}} \|I - I_{\text{target}}\|^2\) for example. The gradient \(\frac{\partial L}{\partial p_i}\) indicates how moving Gaussian \(i\) in space would affect the loss, and similarly for \(\Sigma_i\), etc.

In simpler terms, because each Gaussian’s influence on the image is smooth and continuous (thanks to the Gaussian function), the rendering is differentiable w.rt. the Gaussian parameters. The gradients tell us how to adjust each Gaussian to reduce the reconstruction error.

The differentiable rendering equations for Gaussian splats ensure that one can start from an initial guess (say a sparse set of Gaussians) and converge to an accurate model of the scene by iterative optimization (gradient descent).

Training and Optimization

The 3D Gaussian Splatting pipeline is trained end-to-end by optimizing the parameters of all Gaussians to match a set of input photographs.

Photometric Loss

The training uses a photometric loss that compares rendered images to ground truth photos:

\[L(\theta) = (1 - \lambda) \cdot L_1 + \lambda \cdot L_{D-SSIM}\]

Where: - \(L_1 = \|I_{render} - I_{gt}\|_1\) (L1 distance between rendered and ground truth images) - \(L_{D-SSIM} = 1 - SSIM(I_{render}, I_{gt})\) (structural similarity term) - \(\lambda\) is a weighting factor (typically 0.2) - \(\theta\) represents all Gaussian parameters

Initial Point Cloud

The training typically starts from a sparse point cloud generated by Structure-from-Motion (SfM):

  1. Each point in the SfM reconstruction becomes an initial 3D Gaussian

  2. Through optimization, these Gaussians are refined, and new ones are added as needed

  3. The final model might contain millions of optimized Gaussians

This approach enables reconstruction from a minimal set of images without requiring dense MVS reconstruction or additional inputs.

Optimization Process

The optimization process for 3D Gaussian Splatting involves adjusting millions of parameters (positions \(p_i\), opacities \(\alpha_i\), covariance \(\Sigma_i\), and colors \(c_i\) of each Gaussian) so that rendered images match the input photographs. This is done by minimizing a reconstruction loss between the rendered image and ground truth, for each training view, summed over all views.

Iterative Optimization: The method uses gradient descent (or a variant like Adam optimizer) to update the Gaussian parameters. Because each Gaussian’s effect on each pixel is differentiable, they can accumulate gradients by rendering the current estimate from each input view, comparing to the actual image, and backpropagating.

At each gradient step, for each Gaussian, they compute how the image error changes if that Gaussian’s parameters change – this is done efficiently using derived analytic gradients. For example, the gradient w.rt. position will consider the image gradient (difference) at the splat’s location, essentially pushing the Gaussian toward where the image says it should be.

The optimization proceeds with iterations of: render → compute loss → compute gradients → update Gaussians. This continues until the error converges. During this process, they also perform adaptive density control, which adds or removes Gaussians to improve the fit.

Differentiable Splatting Pipeline

The entire rendering pipeline is differentiable, allowing gradients to be computed with respect to all Gaussian parameters:

  1. Position gradients: How changing a Gaussian’s position affects the image

  2. Covariance gradients: How changing a Gaussian’s shape affects the image

  3. Opacity gradients: How changing a Gaussian’s opacity affects the image

  4. Color gradients: How changing a Gaussian’s color affects the image

These gradients are used to update the parameters via gradient descent (typically Adam optimizer).

Adaptive Density Control

One important note: unlike some other methods that needed special handling for far-away geometry (Mip-NeRF warps space, or Plenoxels require an octree to allocate distant voxels), Gaussian Splatting’s Gaussians remain in Euclidean space with no additional warp. The adaptive control naturally covers the space with as many Gaussians as needed. Distant parts of the scene might simply get more Gaussians if needed, rather than compressing them.

  1. Densification: Adding Gaussians where needed - Splitting: Large Gaussians with high error are split into smaller ones - Cloning: Small Gaussians with high error are cloned (duplicated and slightly offset)

  2. Pruning: Removing Gaussians with negligible contribution (very low opacity)

This procedure allows the representation to adapt to the scene’s complexity, allocating more Gaussians to detailed regions and fewer to simple areas.

The overall optimization algorithm alternates between: 1. Parameter Update: Standard gradient descent to update existing Gaussian parameters 2. Density Control: Periodically adjusting the number and distribution of Gaussians

Implementation and Real-Time Rendering

Efficient implementation is crucial for achieving real-time performance with 3D Gaussian Splatting.

Tile-Based Rendering

To achieve real-time performance, 3D Gaussian Splatting uses a tile-based approach:

  1. Divide the screen into tiles (e.g., 16×16 pixels)

  2. For each tile: - Determine which Gaussians overlap it - Sort those Gaussians by depth - Render them front-to-back

This localizes sorting to small batches of Gaussians, dramatically reducing computational overhead. The naive approach of considering every Gaussian for every pixel would be far too slow. Tile-based rendering exploits spatial coherence in the scene and image.

Fast Sorting Strategies

Sorting is required because correct alpha blending needs a defined order (usually far-to-near or near-to-far). The implementation uses GPU sorting algorithms that are highly optimized (e.g., parallel radix sort or merge sort using CUDA). Once Gaussians are culled per tile, each tile’s list is sorted by depth (the Gaussian’s center depth, or perhaps the depth of the Gaussian’s frontmost point – but likely center is fine if Gaussians are small relative to depth variation).

Sorting thousands of items on GPU is quite fast (modern GPUs can sort millions of numbers in a few milliseconds). And since this can be done tile-parallel (each tile sorting in parallel), the overall complexity is manageable.

The renderer is visibility-aware, meaning it accounts for occlusion by sorting. It also avoids “hard limits” on number of splats that get gradients, which implies the sorting + compositing method can handle an arbitrary number of overlapping Gaussians without capping (unlike some prior methods).

After sorting per tile, the renderer then rasterizes the Gaussian splats into that tile’s pixels in depth order. Rasterization means drawing the 2D Gaussian (ellipse) shape. This can be implemented as drawing either a textured sprite (point splat) or drawing a screen-aligned quad and in the fragment shader evaluate the Gaussian equation to get the coverage and alpha.

To ensure efficiency, the implementation likely uses shader programs or CUDA kernels that process each Gaussian and output to pixels. A naive draw call per Gaussian would be too slow (millions of draw calls). Instead, they might use a single compute shader that processes all Gaussians in a tile and writes into a tile buffer with atomic operations or blending.

GPU-Accelerated Rasterization

All the rendering steps are implemented on GPU to achieve real-time performance:

  • Memory for Gaussians: Storing up to 5 million Gaussians, each with position (3 floats), rotation (quaternion, 4 floats), scale (3 floats), color (e.g., 9 SH coefficients × 3 channels = 27 floats), opacity (1 float) can add up. For example, if each Gaussian has ~40 floats, 5 million Gaussians = 200 million floats, which is 800 MB (if 4 bytes each). They may compress some (for instance, color SH could be 1st or 2nd order only, or stored in half precision).

  • Spatial data structure: The tile culling essentially is a spatial index. For rendering, they may pre-bucket Gaussians into some spatial grid or tree to quickly get those in each tile. During training, Gaussians move, so they likely update tile assignments dynamically but efficiently.

  • Parallel rasterization: Modern GPUs support drawing points with programmable size, but anisotropic Gaussians (rotated ellipses) may require a custom approach: either rendering each Gaussian as a small triangle mesh (like an oriented billboard quad) or using a compute kernel.

The rasterizer has constant overhead per pixel and low memory consumption, which suggests they do not allocate per-pixel large arrays (some earlier methods had per-pixel lists of splats). They probably use blending on the fly, so memory usage is just the final image and maybe a small depth buffer or so.

The implementation also supports anisotropic splats and fast back-propagation in rendering. For backprop, they likely store minimal info to compute gradients, possibly keeping a record of which Gaussians contributed to which pixel or recomputing during backprop similarly to forward but accumulating gradient instead of color.

Memory use is mainly: - The Gaussian list (which could be hundreds of MB) - The image buffer (e.g., 1920x1080x RGBA) - Some intermediate buffers for tile indices, sorting keys, etc.

Memory Considerations

The memory footprint of 3D Gaussian Splatting depends on:

  • Number of Gaussians (typically 1-5 million for a complex scene)

  • Parameters per Gaussian (position, rotation, scale, color, opacity)

A typical scene might require hundreds of megabytes of memory, which is manageable on modern GPUs. The approach claims real-time rendering (≥ 30 fps at 1080p) for the final scenes, which is remarkable given the complexity.

Speed optimizations include: - Tile culling to reduce fragment shader work - Parallel sorting - Controlled blending - Careful use of shared memory and warp-synchronous programming in CUDA to handle each tile’s compositing

The method also leverages that speed during training by rendering the scene from random new viewpoints on the fly to compute loss, which accelerates the training process as well.

Comparison with Other Methods

NeRF vs. 3D Gaussian Splatting

Neural Radiance Fields (NeRF) and 3D Gaussian Splatting both ultimately represent scenes as a volumetric field of color and density, but their implementations differ drastically:

  • Representation: NeRF encodes the scene in the weights of a neural network (usually an MLP) which, given a 3D coordinate (and viewing direction), outputs color and density. This is a continuous implicit representation. Gaussian Splatting, on the other hand, uses an explicit list of primitives (Gaussians) with explicit parameters stored for each.

  • Rendering: NeRF uses ray marching – sampling many points along each ray and querying the network for density and color, then compositing. This can produce high-quality results but is slow because of the many network evaluations. Gaussian Splatting uses rasterization of primitives – no heavy per-sample computation, just blending of already stored colors. This is orders of magnitude faster at render time.

  • Training Speed: Vanilla NeRF is slow to train (hours to days) due to optimizing many network weights and sampling all rays repeatedly. Many improvements (InstantNGP, Plenoxels) sped this up. Gaussian Splatting trains very fast (minutes for a scene) because it directly optimizes a comparatively fewer parameters (millions of Gaussians vs millions of network weights, but often similar order) and uses efficient rendering in the loop.

  • Quality and Generalization: NeRF with enough capacity can often achieve slightly higher peak quality for a given scene. However, recent results show Gaussian Splatting achieving comparable or even superior quality to state-of-the-art NeRF variants for captured scenes.

Aspect

Neural Radiance Fields

3D Gaussian Splatting

Representation

Implicit (neural network)

Explicit (set of 3D Gaussians)

Training Time

Hours to days

Minutes

Rendering Speed

Seconds per frame

30-135 FPS (real-time)

Memory Usage

Low (network weights)

Moderate (millions of Gaussians)

Quality

High

High

Editability

Limited (implicit)

High (explicit primitives)

Voxel-Based Representations vs. Gaussians

Voxel-based radiance fields (like Plenoxels, DVGO, or Neural Volumes) discretize space into a 3D grid. Each cell stores density and possibly a view-dependent color. Rendering is then a ray-marching through the grid, trilinearly interpolating values.

  • Memory: Voxel grids can be very memory heavy if high resolution is needed everywhere. Sparse voxel techniques allocate more voxels where needed and prune empty space, but for large scenes, memory can explode or require multi-level grids. Gaussian Splatting is more memory-efficient for large empty regions.

  • Granularity: Voxels provide a fixed resolution (size of each voxel). Features smaller than a voxel cannot be represented unless you subdivide further. Gaussians are continuous and can have arbitrarily small covariance to represent fine features.

  • Rendering speed: A dense voxel grid requires sampling many steps even in empty space (unless skipping empty cells using an octree or occupancy grid). Gaussians inherently skip empty space – no Gaussian, nothing to render.

Aspect

Voxel Grids (e.g., Plenoxels)

3D Gaussian Splatting

Adaptivity

Limited by grid resolution

Highly adaptive (varying density)

Memory Efficiency

Low (stores empty space)

High (concentrates on surfaces)

Rendering Speed

Moderate to high

Very high

Detail Representation

Limited by grid resolution

Can represent fine details

Traditional Point-Based Rendering vs. Gaussian Splatting

Traditional point-based rendering (PBR) renders point clouds with splats or discs, often for models acquired from real-world scans. Methods from the early 2000s introduced ellipse splatting with EWA filtering to improve quality.

  • Rendering Algorithm: Traditional PBR like Surfels (Pfister 2000) or EWA Surface Splatting (Zwicker 2001) would take a point cloud (with normals, radiance etc.) and in a single pass project each point as a disk onto the screen, blending with neighbors to fill holes. Gaussian Splatting treats all Gaussians as potentially translucent and renders them with alpha blending in depth order.

  • Differentiability and Optimization: Traditional point rendering took the points as given and did not move them. Gaussian Splatting integrates the rendering into an optimization loop to adjust points.

  • Splat shape: Classic splatting often used circles or ellipses oriented according to surface normal. Gaussian Splatting’s splats are not exactly oriented by a surface normal (they don’t even explicitly store a normal), but effectively the covariance can align with the local surface orientation.

Aspect

Traditional Point-Based Rendering

3D Gaussian Splatting

Point Representation

Disks or simple splats

Anisotropic 3D Gaussians

View-Dependent Effects

Limited

Supported via spherical harmonics

Quality

Moderate

High

Differentiability

Not always

Fully differentiable

Adaptive Density

Fixed

Dynamic (splitting/pruning)

Applications and Extensions

3D Gaussian Splatting has been applied to various domains and extended in multiple directions.

Static Scene Reconstruction

The primary application of 3D Gaussian Splatting is static scene reconstruction for novel view synthesis. Given a set of images of a scene (with known camera poses via structure-from-motion), one can use Gaussian Splatting to reconstruct the scene’s appearance such that new viewpoints can be rendered at high quality.

This makes it a drop-in alternative for NeRF in tasks like: - Virtual tourism - Heritage preservation - Architectural visualization - Content creation for VR/AR

Because of its fast training, this method is appealing for scenarios where turnaround time is important (e.g., on-set reconstruction in filmmaking, or quickly scanning an environment with a phone and getting a 3D model within minutes). The fact it achieves real-time rendering means the reconstructed model can be used in interactive applications, VR/AR, or video games directly, without needing heavy neural network inference at runtime.

Dynamic Scene Capture

Extensions to handle dynamic scenes include:

  • Deformable Gaussians: Adding a deformation field to warp Gaussians over time

  • 4D Gaussians: Extending the representation to include time as a fourth dimension

  • Dynamic 3D Gaussians: Tracking the motion of Gaussians through a sequence

These approaches enable high-quality reconstruction and rendering of dynamic content like humans in motion. A fully 4D Gaussian representation would be where each Gaussian is actually a Gaussian in spacetime (with a mean position in 3D and time, and a covariance that could span time as well).

Avatar Creation and Animation

3D Gaussian Splatting can be combined with parametric human models like SMPL:

  1. Use SMPL as a template for human shape and pose

  2. Optimize Gaussians to match the appearance of a specific person

  3. Animate the Gaussians by transferring motion to the underlying skeleton

This creates animatable avatars with high visual fidelity for applications in virtual reality, gaming, and digital communication.

Integration with Neural Rendering

Gaussian Splatting can be integrated with neural networks in various ways:

  • Neural Refinement: After producing a Gaussian-splatting render, one could feed the rendered image (and perhaps depth) into a neural super-resolution or refinement network to enhance details or consistency.

  • Hybrid Models: One could use Gaussians as an intermediate representation inside a neural pipeline. For example, a neural network could predict the Gaussian parameters from input images (instead of optimizing them).

  • Neural Shading: The Gaussian splats currently use simple SH lighting. One could incorporate a small neural network that given viewing direction and perhaps other features outputs a shading multiplier for each Gaussian.

  • Feature Integration: Gaussians could carry not just color but feature vectors that are processed by a network to produce final pixel colors.

One interesting integration is using Gaussians as an initial guess for NeRF or vice versa. Since Gaussian Splatting is fast, one could get a decent model quickly, then perhaps distill it into a NeRF network for portability or editing.

Large-Scale Scene Rendering

Large-scale scenes (think a whole city block or a forest) pose challenges in representation and rendering. Gaussian Splatting can handle unbounded scenes, but scaling to truly large scenes with millions of distinct objects might require additional strategies:

  • Tiling the Scene: Just like tiling the image, one can spatially partition the scene into regions (octree or grid of cells). Each region contains a subset of Gaussians. One can frustum-cull entire regions quickly if they are out of view, and only stream the Gaussians for visible regions to the GPU.

  • LOD (Level of Detail): For far distance, you might not need all Gaussians. Perhaps combining some into a single Gaussian (merging) could reduce count.

  • Out-of-Core management: If a scene is so large that millions of Gaussians don’t fit in memory, one could load and unload parts based on camera. Since each Gaussian is independent, chunks can be loaded from disk (with some spatial index).

A noteworthy extension addressing larger scenes is GaussianPro (Cheng et al., 2024). GaussianPro specifically tackled the issue that SfM initialization fails on texture-less areas, leading to insufficient Gaussians there and poor reconstruction. They introduced a progressive propagation to add Gaussians in those areas using patch-match stereo cues. This improved quality on large indoor/outdoor scenes.

Bézier Gaussian Triangles (BG-Triangle) for Sharper Rendering

One limitation of plain 3DGS is its tendency to blur high-frequency details, especially along sharp object edges. Because Gaussian splats overlap and have smooth radial profiles, representing a hard edge or surface boundary would require many tiny Gaussians, and even then edges may appear slightly soft. The BG-Triangle method addresses this by introducing a hybrid primitive that combines Gaussian splats with parametric surface patches to preserve discontinuities. Specifically, Bézier Gaussian Triangles (BG-Triangles) represent local surfaces as curved triangular patches (a Bézier triangle is a curved surface defined by control points, generalizing Bézier curves to 2D domains) rather than a cloud of points. Each such patch carries its own smooth internal texture representation but maintains explicit sharp boundaries at its edges.

Representation

A BG-Triangle is defined by a set of control points forming a Bézier surface patch of a certain degree (e.g. quadratic). This patch can model a continuous surface area. Along with this, BG-Triangle stores an attribute map (akin to a texture) that encodes color and possibly finer details across the surface. The key idea is that the interior of the triangle can still be rendered via splatting sub-primitives, but the edges of the triangle serve as differentiable discontinuities, preventing cross-boundary blurring. During rendering, the curved triangle is tessellated into many small pieces (sub-triangles) if needed. Those pieces are then rasterized using a modified alpha blending that is discontinuity-aware – essentially ensuring that adjacent triangles do not bleed color across their borders. Each sub-triangle or sample on the patch can still be associated with Gaussian kernels for anti-aliasing (hence “Gaussian triangle”), but the structure imposes a sharper geometry.

Performance

By fitting vector-like surface primitives, BG-Triangle drastically reduces the number of primitives needed to represent a scene while maintaining or improving visual quality. Yao et al. report that BG-Triangle achieves comparable error metrics to 3DGS on standard benchmarks, yet uses far fewer primitives and yields much crisper edges in novel views. For example, in close-up views of the NeRF Synthetic dataset, BG-Triangle renders object boundaries sharply, whereas Gaussian splats produce slight blur or splatter. Table 2 in their paper quantitatively shows BG-Triangle outperforming pure Gaussian methods (3DGS and a mipmapped variant) under equal-capacity settings. The method’s boundary-preserving advantage is evident: as noted, BG-Triangle can even appear sharper than the ground truth at some zoom levels due to its anti-aliasing, while 3DGS appears blurrier. Furthermore, BG-Triangle bridges classical and neural representations – it hints at how vector graphics-style patches can be optimized in a differentiable way from images. This hybrid approach can potentially lead to representations that are easier to edit (since one can move control points of patches) while still being learned from data.

In summary, BG-Triangle extends 3D Gaussian Splatting by introducing structured surface elements. It preserves the benefits of 3DGS (differentiability, volumetric effects, rasterization speed) but mitigates its weaknesses at discontinuities. The cost is a slightly more complex pipeline (tessellation, handling of patch boundaries) and possibly the need for an initial mesh or surface guess to initialize control points. However, robust results have been demonstrated by automatically deriving coarse “boundary meshes” from a point cloud and then optimizing the Bézier patches. BG-Triangle’s success suggests that blending explicit geometric primitives (triangles) with Gaussian splats is a promising direction to achieve sharp, low-memory 3D representations.

Human Reconstruction with Gaussian Splatting and Priors

Modeling human subjects, especially from sparse camera views, is a challenging task where Gaussian splatting has recently made inroads. The human body presents complex, articulated geometry with clothing, and purely implicit or point-based methods struggle to generalize across different people and poses. Recent methods leverage human priors – such as parametric body models (SMPL/SMPL-X) or learned shape spaces – in combination with radiance field or splatting techniques to improve both quality and speed of novel-view rendering. Two notable threads are generalizable NeRF-based approaches like EG-HumanNeRF, and Gaussian splat-based approaches like GPS-Gaussian and Generalizable Human Gaussians (GHG). We discuss how each integrates human priors and the role of Gaussian splats.

EG-HumanNeRF: Efficient Generalizable Human NeRF

EG-HumanNeRF (Wang et al., 2024) is a NeRF-style pipeline that achieves real-time rendering of human scenes from sparse views by heavily incorporating a parametric body prior. While not a Gaussian splatting method per se, it directly addresses the same problem setting as many GS approaches (fast novel view synthesis of humans) and offers a useful comparison between explicit splatting and efficient neural rendering. EG-HumanNeRF’s key ideas include: (1) Using a fitted SMPL-X body mesh as a coarse geometry prior to guide sampling; (2) Reducing ray samples via a two-stage sampling strategy that first intersects rays with an inflated “boundary mesh” around the body (to skip empty space) and then only samples a small number of points near the surface for volume rendering; (3) An occlusion-aware attention mechanism that uses the prior mesh to infer which regions are occluded from each view, helping the network avoid artifacts in unseen areas; and (4) An image-space refinement network to further clean up the rendered output. They also introduce a signed ray distance function (SRDF) loss at each sample to regularize the volume density learning.

The result is a system that outperforms prior human view synthesis methods in quality while running at comparable speeds to the fastest (splatting-based) methods. For instance, EG-HumanNeRF is shown to eliminate artifacts that other methods produce in occluded areas: when input views are sparse, baseline methods had blurriness or missing limbs (e.g. fingers) due to uncertain geometry, whereas EG-HumanNeRF’s use of the SMPL prior plus occlusion reasoning yields intact, sharp renderings. A comparative table from their paper (reproduced in Figure 1 of the source) contrasts EG-HumanNeRF with others: methods like KeypointNeRF and others used some human prior but were slow, while methods like GPS-Gaussian (discussed next) were fast but lacked any human prior and thus could suffer in quality. EG-HumanNeRF hits a sweet spot by using the mesh prior for both speed (drastically fewer ray samples) and quality (guiding the NeRF to plausible human shapes even in unseen regions). In terms of numbers, it can render at real-time rates (on the order of 20–30 FPS) and produces high-fidelity images that surpass prior generalizable NeRFs by >1 dB PSNR in their tests, without test-time fine-tuning.

The success of EG-HumanNeRF underscores the value of integrating parametric models (SMPL-X) with volumetric rendering. By itself, NeRF would require many samples and might falter with few views, but the human prior constrains the solution. This idea carries over to Gaussian splatting approaches as well.

GPS-Gaussian: Pixel-Wise Gaussian Splatting for Humans

GPS-Gaussian (Zheng et al., CVPR 2024) is a generalizable human view synthesis method that uses 3D Gaussian splats as the underlying representation, but predicts them in a single forward pass of a network (no per-subject optimization). The motivation is to achieve instantaneous rendering of a new person given only a sparse set of images, by training on a large dataset of human scans. GPS-Gaussian breaks from the typical approach of optimizing a point cloud for each subject; instead it learns a direct mapping from input views to Gaussian parameters.

Key innovations: GPS-Gaussian introduces the concept of pixel-aligned Gaussian parameter maps. Given two (or a few) input images of a person from different views, the method produces for each source view an image-sized map where each pixel stores the parameters of a Gaussian (position, orientation, scale, color, opacity) that corresponds to that pixel’s projection of the person. In essence, it reprojects the problem to 2D: each foreground pixel “emits” a Gaussian into 3D. To lift these into 3D space, a differentiable stereo depth module estimates per-pixel depth using the two source views. Once depth is estimated, the 2D Gaussian parameters from both views are unprojected to 3D, yielding a set of 3D Gaussians covering the human’s surface. These can then be splatted to render a novel view with standard 3DGS rendering. Because some pixels might be erroneous in depth (especially under self-occlusion), they train the depth estimator and the Gaussian prediction jointly, with an iterative refinement and losses that penalize inconsistent depths and improve rendering quality.

Results: The outcome is impressive – GPS-Gaussian achieves 2K resolution novel view synthesis at >25 FPS on a single GPU, with a single network handling arbitrary new subjects. It requires no test-time optimization/fine-tuning; a new human can be rendered nearly instantly after a quick inference of the two-view networks. The use of convolutional image encoders (rather than 3D network queries) makes it very fast. Empirically, it outperforms prior state-of-the-art human view synthesis methods on quality while being real-time. Notably, Zheng et al. report that even compared to explicit point-based approaches, their learned method produces better geometry and appearance for unseen poses. The reliance on large-scale training (they train on “massive 3D human scans with diverse topologies, clothing styles and poses”) means the model has learned a strong human shape prior implicitly. In practice, this is similar to how one might use SMPL: the network has an idea of human morphology, which helps guide the depth predictions and color completion for occluded parts.

The difference from EG-HumanNeRF is that GPS-Gaussian is fully feed-forward. It does not even explicitly use a body model like SMPL; however, it effectively learns one internally via the training data. An advantage of GPS-Gaussian’s explicit splat output is that once the Gaussians are predicted, rendering is extremely fast and also flexible (one could in principle adjust them, apply transformations, etc.). The method demonstrates how Gaussian splatting can be plugged into a deep network pipeline for generalization, combining the best of both worlds: neural priors from data and the efficiency of explicit splats.

Generalizable Human Gaussians (GHG) with SMPL

An alternative approach to human generalization is to explicitly tie the Gaussian representation to a parametric human model. Generalizable Human Gaussians (GHG) by Kwon et al. (arXiv 2024) does this by learning Gaussian parameters on the 2D UV space of a template mesh (SMPL-X). In GHG, each Gaussian is essentially anchored to a location on the SMPL-X surface (via its UV coordinate). This means when a new subject with a fitted SMPL-X mesh is given, one can map the learned Gaussian field onto that mesh, adjusting for the person’s shape and pose. The Gaussian colors and offsets are predicted by a network, similar in spirit to GPS-Gaussian but using the UV parametric space as the domain for convolution (rather than image pixels). A multi-scaffold strategy is introduced to allow additional flexibility – e.g. one scaffold might model coarse body shape while another adds high-frequency clothing wrinkles as offsets to Gaussian positions.

By leveraging the strong geometry prior of SMPL-X, GHG ensures that the learned Gaussians always lie on or near a plausible human surface. This greatly reduces ambiguity from sparse views. Their results show excellent generalization: from as few as 3 input images of a new person, GHG can render photorealistic views without finetuning, outperforming prior generalizable methods on both seen and unseen datasets. Because the representation is explicit (a set of Gaussians on a mesh), it retains real-time rendering capability via splatting. In fact, GHG uses the same volume splatting renderer as 3DGS (they call it a “fast rendering paradigm of volume splatting”). The difference is that the Gaussians are now organized according to a mesh template. This also allows the method to potentially do animation: since the Gaussians move with the SMPL-X pose, one could drive the captured appearance with new poses (though the paper’s focus is novel-view rather than novel-pose).

Integrating SMPL in this way has some trade-offs: it assumes a decent fit of the parametric model to the subject (which is feasible with few images using existing human pose estimators), and it inherits the template’s limitations (e.g. difficulty with very loose clothing or long hair not represented in SMPL-X). However, the clear benefit is consistency and completeness – even if an arm is completely occluded in the input views, the SMPL prior ensures Gaussians are still placed for that arm, and their colors can be inferred from neighboring pixels via the learned UV-space network. This addresses a common failure of unguided methods, which might leave holes or require inpainting for unseen regions.

In summary, these human-centric developments show that Gaussian Splatting can be enhanced by human structural priors in multiple ways. EG-HumanNeRF uses the prior to guide NeRF sampling; GPS-Gaussian trains a feed-forward network (implicitly learning a prior) to predict splats; and GHG explicitly uses the SMPL-X surface as a scaffold for Gaussians. All achieve high-quality, fast human renderings from sparse inputs – a task that is significantly harder than the static scene case. A common theme is that the combination of explicit (Gaussian or mesh) representations with learned priors is very powerful for generalization.

Dynamic Scene Reconstruction with 4D Gaussian Splatting

Beyond static scenes and single moments, researchers have extended Gaussian splatting to model dynamic, time-varying scenes (a full 4D space-time representation). Traditional dynamic view synthesis with NeRFs often used deformation fields or per-frame NeRFs, but these are computationally heavy and hard to run in real time. Point-based approaches promise faster rendering, and indeed 4D Gaussian Splatting (4DGS) has emerged as a leading explicit method for dynamic scenes. In 4DGS, each primitive is a spatio-temporal Gaussian – effectively an ellipsoid moving or changing over time.

4D Gaussian Splatting (4DGS)

Wu et al. (CVPR 2024) introduced one formulation of 4DGS that achieves real-time rendering of dynamic scenes, even for challenging cases like non-rigid motions. Their approach represents the scene with one set of canonical 3D Gaussians (at a reference time) and learns a Gaussian deformation field that maps those Gaussians to their positions at each time frame. In other words, rather than have completely independent Gaussians for each time, they have persistent identity for Gaussians that move over time. A tiny multi-head MLP (the deformation decoder) takes as input a Gaussian’s canonical coordinates and a time \(t\), and outputs that Gaussian’s offset, rotation, and scale change for time \(t\). A spatial-temporal encoder processes groups of Gaussians to embed motion cues, allowing the network to infer coherent motion even if a Gaussian was occluded for a while. This approach is akin to learning a continuous motion field for the points. Because only one set of Gaussians is stored (the canonical set) plus a small network, the memory remains manageable.

Advantages: The deformation-field 4DGS has a big advantage in storage and coherence. Naively, one could optimize a separate 3DGS model for every time frame, but for a long sequence that would multiply storage and also not exploit temporal redundancy. Wu et al. show that with their method, they can render dynamic scenes at up to 82 FPS (800×800) or ~30 FPS at HD resolution, with quality on par with or better than dynamic NeRFs. Importantly, their representation is compact and editable: since each Gaussian persists through time, one can track correspondences or even modify motions. They demonstrate basic 4D editing and object tracking as a benefit. This explicit nature is a contrast to neural dynamic radiance fields that often entangle time and view in complex ways.

However, 4DGS in its initial form did face challenges. A major one identified is temporal redundancy: many Gaussians might only “live” for a short period or many Gaussians might be static and yet duplicated across time due to optimization issues. Also, even with deformation fields, the number of Gaussians can still be high (millions) to capture fine details, leading to heavy memory (the original 4DGS could require ~2 GB for a dynamic scene). These issues spurred follow-up research.

Speed and Memory Enhancements (4DGS-1K and MEGA)

In 2025, Yuan et al. proposed 4DGS-1K, dubbed for its achievement of over 1000 FPS rendering of dynamic scenes. They systematically address the redundancies: (1) Short-lifespan Gaussians – they introduce a spatio-temporal variation score to prune Gaussians that only contribute briefly, encouraging the model to explain motion with longer-lived Gaussians. (2) Inactive Gaussians per frame – they maintain a per-frame mask of which Gaussians are actually visible/needed, so the rasterizer can skip large portions that do not affect the current view. With these techniques, 4DGS-1K reduces the Gaussian count dramatically and avoids rasterizing unnecessary splats each frame. As a result, they report 1000+ FPS rendering on modern GPUs, a 50× speedup over the original, while keeping image quality largely intact. Storage was also reduced to a fraction of the original (they note using only ~10% of the storage with comparable quality). This pushes Gaussian splatting firmly into the realm of real-time 4D video. Their evaluations on datasets like N3V (Neural 3D Video dataset with 6 multi-view videos) show that 4DGS-1K achieves similar PSNR/SSIM as vanilla 4DGS but with much smaller memory and vastly higher FPS. Such performance makes even interactive applications (e.g. dynamic scene VR) conceivable with explicit splats.

Another effort, MEGA (Zhang et al., 2024) focused on memory efficiency of 4DGS. They observed that a big memory hog was the color storage per Gaussian: original 4DGS stored up to 144 parameters of SH for color. MEGA replaces this with a hybrid approach: each Gaussian has just a 3-parameter base color plus a learned global color predictor that adds view-dependent effects. This cuts down memory enormously (no need for high-order SH per point). They also impose an entropy-based regularization on the deformation field so that each Gaussian covers a larger temporal span (similar spirit to 4DGS-1K’s idea) and they penalize opacity complexity to force fewer Gaussians. Combined with compression (half precision, etc.), they achieved up to 125×–190× reduction in storage on dynamic datasets, again without quality loss. These improvements mean a dynamic scene that might have taken gigabytes can be stored in just tens of MB, which is even smaller than some implicit methods, all while preserving the real-time nature (rasterization remains fast).

Applications to MoCap and 4D Human Rendering

Dynamic Gaussian splatting has also been applied specifically to human motion capture and 4D human reconstruction. For instance, some works (e.g. Li et al. 2024, as cited in GPS-Gaussian) learn animatable Gaussian avatars – essentially learning pose-dependent Gaussian distributions for a human, so that given new skeletal poses, the Gaussians move accordingly to render novel poses. These approaches combine the benefits of skeletal animation (via something like SMPL) with Gaussian splat rendering. The Animatable Gaussians work by Li et al. (CVPR 2024) learned pose-dependent 2D Gaussian maps for a human head and body, enabling high-fidelity avatars that can be driven by motion capture data. It indicates that Gaussian splats can serve as an intermediate representation for deformable objects where control signals (like joint angles) modulate the distribution of Gaussians.

Moreover, some early attempts integrate 4DGS with SLAM (simultaneous localization and mapping) for dynamic scenes, demonstrating that one can potentially do online reconstruction of a changing environment by continually updating Gaussian splats in space-time. This could be useful for capturing urban environments with moving vehicles (one reference mentions a “Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction” – likely a specialized dynamic GS for city scenes). Though not all these are published in major venues, they show the extensibility of Gaussian splatting to various domains.

In the context of motion capture (MOCAP), 4DGS can be seen as an alternative to parametric mesh tracking. Instead of capturing a performer with multi-view cameras and fitting a mesh sequence, one could optimize a collection of Gaussians to the multi-view video. The advantage is that the result is fully textured and can be rendered from any viewpoint with correct appearance (the splats directly store color and learned lighting effects). The disadvantage is that it’s less straightforward to retarget or manipulate than a mesh; however, if combined with a skeleton prior (like attaching Gaussians to SMPL, as GHG or animatable Gaussians do), one gains that control back. We are witnessing the convergence of ideas: some methods are bringing neural volumetric ideas into the mesh world (NeRF + SMPL), and others are bringing graphics primitives into the neural world (Gaussians + networks). Dynamic Gaussian splatting sits in the middle, offering a fully explicit yet dynamic scene representation.

Dynamic Benchmarks: Common datasets to test these methods include N3V (a public multi-view video dataset), the D-NeRF synthetic dynamic dataset (with simple animated objects), and for human-specific methods, sequences from ZJU-MoCap or H36M, as well as CAPE (for clothed humans) or even synthetic data derived from AMASS motions. For example, a dynamic human method might take a sequence of a person from multiple cameras (like the H36M dataset of people moving) and reconstruct a 4D Gaussian scene. The dynamic nature means metrics are computed over space-time reconstructions, looking at consistency and temporal coherence in addition to per-frame quality.

Implementation Details and Real-Time Performance

A significant appeal of Gaussian splatting is its practicality: with the right implementation, it can run at high frame rates and scale to large scenes. Here we outline some low-level considerations:

Data Structures

Efficient spatial indexing of millions of Gaussians is important. In static 3DGS, one can use a tree or grid to cull Gaussians outside the camera frustum or below a certain size at a given distance. The original 3DGS used a visibility buffer to skip splatting Gaussians that don’t contribute (similar to a z-buffer test). For LoD, each Gaussian can have a multiscale representation (mip-mapped radius and pre-filtered color), so that when far from the camera, it can be drawn smaller or not at all. Dynamic 4DGS adds a temporal dimension – there one can precompute per-frame active sets of Gaussians, as 4DGS-1K did with bitmasking.

Rasterization & Shaders

Gaussian splats can be rendered using point primitives (with a geometry shader expanding to a billboard quad) or as triangle meshes (e.g. an ellipse approximated by a small polygon). A tile-based rasterization approach can also be used: splitting the screen into tiles and only processing Gaussians that intersect each tile. This was hinted in some implementations (e.g. the MEGA paper mentions a GPU-friendly tile rasterizer). One must also sort splats by depth for correct blending; this can be costly if done on CPU for every frame. Instead, a common trick is to exploit hardware blending by drawing back-to-front sorted by an approximate grid. Some research prototypes divided space into depth layers or sorted per tile to manage this at scale.

GPU Memory and Throughput

Each Gaussian carries a non-trivial amount of data (position, covariance, color etc.). Packing these efficiently (e.g. quantizing covariance or using half floats) can save memory bandwidth. The memory footprint was a concern especially for 4DGS, leading to solutions like MEGA’s SH compression. Also, while millions of Gaussians might be stored, at render time only a fraction are actually drawn (those in view and above a size threshold). Efficient culling on the GPU (via compute shaders that mark active splats) can dramatically reduce the fragment shading cost.

Differentiable Rendering Implementation

For training, one often uses a custom CUDA kernel or OpenGL fragment shader that computes the Gaussian coverage of each pixel and its derivative w.rt. Gaussian parameters. This involves calculating the projected ellipse and its intersection with the pixel. Some works rely on prior art like SoftRasterizer or Differentiable Surface Splatting (DSS), but 3DGS typically implemented its own for anisotropic Gaussians with alpha blending. Because the renderer is differentiable, one can directly optimize millions of Gaussian parameters with gradient descent (often using Adam optimizer). It’s noteworthy that despite the large number of parameters, the explicit nature and good initialization (from SfM) allows fairly quick convergence to high quality.

Training vs. Inference Compute

Training a 3DGS from scratch on a scene can still be time-consuming (though faster than NeRF in many cases). Kerbl et al. reported competitive training times – on the order of hours – to reach high quality, thanks to not shooting rays through empty space. Generalizable methods like GPS-Gaussian shift the heavy lift to offline training on many scans (taking days on large GPUs), but then inference per scene is seconds. In deployment, rendering is extremely fast: essentially just a draw call of all splats. For example, the official 3DGS codebase uses OpenGL to render splats in real time, and the bottleneck becomes how many splats the GPU can rasterize per frame (modern GPUs handle tens of millions of fragments easily at 60+ FPS).

Accuracy vs. Speed trade-offs

Some quality-improving steps can slow down rendering. For instance, alias-free splatting (mip-splatting) requires computing a footprint for each splat at each frame to avoid under-sampling. This is an added per-splat cost, but it prevents shimmering when zooming out. Another example: one could increase the SH basis order to capture more view-dependent effects, but that means more parameters and more compute per pixel to evaluate the SH. Implementations often choose a moderate SH order (like 2nd order, 9 coeffs per color) as a balance. Some of the “compact” variants (Mini-Splatting, Scaffold-GS) introduce small neural networks to predict fewer splats – these hybrid approaches sacrifice a bit of pure speed (due to network inference) in exchange for using fewer primitives overall.

In practice, many of these methods are available as open-source, with highly optimized CUDA code. For example, the original 3DGS authors released a CUDA/OpenGL code that can train on a scene and then export the splat model to be rendered in standard graphics engines. Chaos V-Ray even integrated Gaussian splats into a ray tracer, allowing mixing splats with traditional rendering – this underscores that the format is becoming mature enough for content creation pipelines.

Benchmarks and Comparative Evaluation

To measure progress, researchers evaluate on standard datasets: NeRF Synthetic (Blender) for controlled scenes with ground-truth geometry, LLFF (Real Forward-Facing) for real-world still captures, Tanks and Temples for large-scale outdoor scans, and others for static scenes. For dynamic scenes, benchmarks include D-NeRF (synthetic moving objects), HyperNeRF dataset, and N3V (Neural 3D Video) which provides real multi-view videos. For human-focused methods, common datasets are THuman 2.0 (a set of 3D scanned people with SMPL fits), RenderPeople (a collection of commercial scanned avatars), and sequences like H36M or ZJU-MoCap for moving humans. Additionally, datasets like CAPE (Clothed Avatar Pose dataset) provide sequences of 3D meshes of clothed people, which can be used to evaluate how well methods capture deforming clothing over time.

Static Scene Comparison

On Blender scenes (e.g. Lego, Chair, Drums), 3DGS typically achieves similar PSNR/SSIM to NeRF while rendering 1–2 orders of magnitude faster. BG-Triangle further excels in perceptual metrics (LPIPS) by preserving edges better. For real scenes like LLFF, 3DGS and NeRF are often on par, though 3DGS can struggle if the point cloud initialization is sparse (NeRF might fill in even if few points). Tanks & Temples results in Kerbl et al.’s paper show 3DGS surpassing mip-NeRF360 in some cases – indicating explicit points handle large unbounded scenes well by not blurring distant details (mip-NeRF360 addresses that with elaborate mipmaps, but 3DGS inherently has LoD from splat sizes). One weakness of 3DGS noted is slightly lower accuracy in very fine geometry (e.g. thin wires) because Gaussians might over-smooth them – techniques like mip-splatting or BG-Triangle mitigate this.

Human Novel-View Comparison

EG-HumanNeRF vs. GPS-Gaussian vs. GHG provides an interesting study. EG-HumanNeRF was shown to outperform KeypointNeRF, GP-NeRF, etc., in PSNR/SSIM, and even outdo GPS-Gaussian in problematic occluded cases (because of its occlusion handling). However, GPS-Gaussian, being a CVPR 2024 highlight, also demonstrated superior quality to many contemporaries – likely EG-HumanNeRF’s advantage is on very sparse inputs where the learned prior alone might not guess hidden surfaces as well as a mesh-guided approach. GHG, in their cross-dataset tests (train on THuman, test on RenderPeople), reported higher PSNR/SSIM than both neural (KeypointNeRF, etc.) and explicit (prior NeRF-W, etc.) methods, confirming the benefit of UV-based Gaussian learning. In terms of speed, GPS-Gaussian and EG-HumanNeRF both target real-time: GPS-Gaussian inherently renders at 25+ FPS at high res, and EG-HumanNeRF’s design keeps it near that range (they cite competitive speed with “speed-prioritized” methods like GPS). GHG’s runtime is basically that of 3D splatting (very fast), plus whatever time it takes to predict Gaussians via their network (which could be a few hundred milliseconds given it’s a ConvNet on UV maps). All these methods can work with as few as 2–3 views of a person; their relative performance can depend on how complex the outfit is (a learned method might handle clothing learned from data better, whereas a mesh-guided might handle arbitrary occlusion better).

Dynamic Scene Comparison

Dynamic NeRF methods (e.g. D-NeRF, Nerfies, HyperNeRF, NR-NeRF) often sacrifice speed – some can take minutes per frame to render, which is not real-time. 4DGS and its variants are orders of magnitude faster in rendering (real-time vs. offline), making direct comparison tricky. Quality-wise, early 4DGS achieved comparable PSNR to NeRF-based counterparts on simple benchmarks, and with 4DGS-1K and MEGA, the gap closed further. One interesting comparison is with deformation field methods like NR-NeRF or NSFF: those use a parametric motion field (usually learned via neural networks) to warp a static scene. 4DGS (Wu et al.) similarly uses a learned deformation field but explicitly applied to points. It was noted that NeRF-based deformation methods had much smaller storage (tens of MB) than uncompressed 4DGS (GBs), but with the new compression techniques, 4DGS now can also be in the tens of MB. In terms of visual fidelity, Gaussian methods naturally handle moderate topology changes (e.g. objects splitting) since they are just a cloud of blobs, whereas some NeRF methods that assume a single warp field can struggle with multiple independently moving parts. On the other hand, if a scene has very subtle motions (like a fluttering flag), NeRF might capture it with a neural field smoothly, whereas points might introduce a bit of noise to track it. The 1000+ FPS 4DGS (Yuan et al.) did note a slight reduction in PSNR (<0.2 dB) after aggressive pruning, which is a minor trade-off for the huge speed gain.

Comparative Summary Table

Table 3 Qualitative comparison of various methods

Method

Representation

Use of Priors

Speed

Quality (relative)

Notes

NeRF (Mildenhall’20)

Implicit (MLP), volumetric

None (generic)

Slow (minutes/frame)

High (photorealistic, but needs many views)

Heavy per-scene training

Instant-NGP (2022)

Implicit (hashed grid)

None

Fast training/render (fps-level)

High (slightly lower on some fine details vs NeRF)

Large memory use for grid

3DGS (Kerbl’23)

Explicit Gaussians

SfM init (structure)

Real-time (100+ FPS)

High (≈NeRF quality)

Struggles at very sharp edges

Mip-Splatting (2024)

Explicit Gaussians

None

Real-time (slightly slower due to filtering)

High (sharper, alias-free)

Solves aliasing with 2D prefilter

BG-Triangle (2025)

Hybrid Gaussian + surfaces

None (requires point cloud)

Interactive (tessellation overhead)

High (sharp boundaries best)

Few primitives (compact model)

KeypointNeRF (2023)

Implicit (NeRF)

2D keypoints (pose prior)

Slow (offline)

High (good human detail)

Fails with occlusions, slow inference

EG-HumanNeRF (2024)

Implicit + mesh guide

SMPL-X prior (geometry)

Real-time

Very High (state of art human)

Mesh required; volume rendering

GPS-Gaussian (2024)

Explicit Gaussians (learned)

Learned human prior (data)

Real-time

Very High (state of art human)

Needs training dataset; 2-view input

GHG (2024)

Explicit Gaussians on SMPL

SMPL-X prior (template)

Real-time (after inferring Gaussians)

Very High (state of art human)

Requires fitted SMPL; generalizes well

4DGS (Wu’24)

Explicit 4D Gaussians + MLP

None (generic motion)

Real-time (30–80 FPS)

High (≈ dynamic NeRF)

One model for whole sequence; editable

4DGS-1K (2025)

4DGS + pruning

None

Ultra real-time (1000 FPS)

High (≈4DGS quality)

Minor quality loss for huge speed gain

MEGA (2024)

4DGS + compression

None

Real-time

High (≈4DGS quality)

~100× smaller model size

This table illustrates how 3D Gaussian Splatting and its extensions achieve a remarkable balance: real-time performance with excellent visual quality, often matching or surpassing neural field methods that are much slower. By incorporating domain knowledge (like human body models) or by improving the primitive structure (like BG-Triangles), these methods push the envelope in their respective niches. This table illustrates how 3D Gaussian Splatting and its extensions achieve a remarkable balance: real-time performance with excellent visual quality, often matching or surpassing neural field methods that are much slower. By incorporating domain knowledge (like human body models) or by improving the primitive structure (like BG-Triangles), these methods push the envelope in their respective niches.

Datasets and Resources

Several datasets are commonly used for training and evaluating 3D Gaussian Splatting models.

Rich datasets are crucial for training and evaluating 3D Gaussian Splatting models. This section covers the key datasets commonly used in this domain.

Synthetic NeRF Dataset (Blender Scenes)

The Synthetic NeRF Dataset consists of synthetic 360° scenes rendered in Blender with known camera parameters:

  • Content: A set of 8 objects (e.g., Lego, Chair, Drums) with realistic materials, rendered from viewpoints primarily on the upper hemisphere, with two full-sphere scenes

  • Scale: Each scene includes approximately 100 rendered images at 800×800 resolution

  • Format: Path-traced images with global illumination and reflections, accompanied by ground-truth camera poses

  • Significance: Creates a controlled environment for novel view synthesis evaluation, with perfect camera calibration and high-quality renders

  • Access: Available through the NeRF project’s Google Drive link, with scripts for downloading included in the official NeRF repository

  • Links: - NeRF Project Page: https://www.matthewtancik.com/nerf - GitHub Repository: https://github.com/bmild/nerf - Original Paper: https://arxiv.org/abs/2003.08934

This dataset is ideal for validating methods in a controlled setting, as it provides clean, noise-free data with ground truth for both geometry and camera parameters.

LLFF (Local Light Field Fusion)

LLFF consists of forward-facing captures of real-world scenes:

  • Content: A set of 24 real-world scenes captured with a handheld smartphone, though a subset of 8 scenes (e.g., Fern, Flower, Fortress) is most commonly used

  • Scale: Each scene contains approximately 20-30 images at resolution ~1008×756

  • Format: Forward-facing captures (looking inward toward the scene, covering roughly one side)

  • Significance: Demonstrates that even with sparse, handheld captures, high-quality novel views can be synthesized

  • Access: The most commonly used 8 scenes are available via the NeRF Google Drive, with camera poses estimated using COLMAP

  • Links: - LLFF Project Page: https://bmild.github.io/llff - GitHub Repository: https://github.com/Fyusion/LLFF - Used in NeRF Paper: https://arxiv.org/pdf/2003.08934

LLFF presents a common benchmark for evaluating view interpolation/extrapolation in a limited forward-facing view band, replicating a realistic capture scenario.

Tanks and Temples

Tanks and Temples is a high-quality benchmark for large-scale scene reconstruction:

  • Content: Diverse real scenes captured in indoor and outdoor settings (e.g., a courtyard, a living room with a tank model, a temple structure)

  • Scale: Split into Training (7 scenes), Intermediate (8 scenes: Family, Francis, Horse, Train, etc.), and Advanced (6 scenes: Auditorium, Ballroom, Temple, etc.) sets

  • Format: Each scene is provided as a high-resolution video (4K) and downsampled frames, with ground-truth 3D geometry from industrial laser scanners

  • Significance: The varying complexity of scenes (from small objects to building-scale environments) and precise laser scans create an excellent challenge to test reconstruction fidelity

  • Access: Available from the official Tanks and Temples website, with a Python downloader for grabbing all videos or images

  • Links: - Tanks and Temples Download Page: https://www.tanksandtemples.org/download/

This dataset is particularly valuable for evaluating how well 3D Gaussian Splatting performs on complex, large-scale real-world environments.

Multi-Object 360 (CO3D - Common Objects in 3D)

CO3D is a large-scale dataset of real-world object-centric captures:

  • Content: 50 common object categories (mostly rigid objects from the COCO taxonomy, like cars, chairs, hydrants)

  • Scale: Approximately 19,000 short video sequences (each focusing on a single object), totaling 1.5 million frames

  • Format: Each sequence includes camera poses (extrinsics and intrinsics), a reconstructed point cloud for the object, and foreground mask annotations

  • Significance: Objects are captured with full 360° coverage whenever possible, enabling category-specific 3D reconstruction and novel view synthesis

  • Access: The full dataset (CO3D-v1) is publicly available, with CO3Dv2 (expanded to ~2× sequences with improved quality) also released

  • Links: - GitHub Repository: https://github.com/facebookresearch/co3d - Original Paper: https://arxiv.org/abs/2109.00512

CO3D’s scale and variety make it ideal for training models that can infer 3D shape and appearance of an object from few views or even a single view.

AMASS (Archive of Motion Capture as Surface Shapes)

AMASS unifies motion capture data into a common representation:

  • Content: Aggregates 15 existing mocap datasets (e.g., CMU Mocap, HumanEva, SFU)

  • Scale: Over 40 hours of motion data (about 11,000 motion sequences from 344 subjects)

  • Format: Each sequence is a time-series of poses in SMPL parameter space (body pose parameters, shape, and translational root motion)

  • Significance: Turns diverse mocap data into a single dataset of parametrized human 3D surface motions, enabling data-driven research on human body dynamics

  • Access: Free for research but requires agreeing to a license; users must register on the AMASS site

  • Links: - AMASS Project Page: https://amass.is.tue.mpg.de/ - MoSh++ GitHub Repository: https://github.com/nghorbani/moshpp

While primarily used for human motion modeling, AMASS is valuable for creating dynamic scenes to test 3D Gaussian Splatting of moving subjects.

CAPE (Clothed Auto Person Encoding)

CAPE provides dynamic clothed human meshes with corresponding body shapes:

  • Content: 15 subjects (10 male, 5 female) each performing a variety of movements while wearing different outfits

  • Scale: 600+ sequences, with over 140,000 frames of 3D scans

  • Format: Each frame is a 3D mesh of a person in a specific pose and clothing configuration, registered to the standard SMPL body mesh (6,890 vertices)

  • Significance: For every frame, CAPE provides both the clothed mesh and the underlying nude body mesh (in the same pose), plus SMPL pose parameters

  • Access: Available upon request via the CAPE site

  • Links: - CAPE Dataset Page: https://cape.is.tue.mpg.de/dataset.html - GitHub Utilities: https://github.com/qianlim/cape_utils

CAPE is particularly useful for learning models of clothing deformation on bodies and for training networks to infer body shape under clothing.

THuman / THuman2.0

THuman provides high-fidelity 3D human scans with realistic textures:

  • Content: THuman v1 (2019) contains about 6,000 scans of human subjects in various poses. THuman2.0 (2020) includes 500 distinct scans of humans with high-resolution geometry

  • Scale: THuman2.0 offers 500 scans with 8K textures and detailed geometry (~300K faces per mesh)

  • Format: Each scan in THuman2.0 is an .obj mesh with a photorealistic texture map and fitted SMPL-X body parameters

  • Significance: Used for digital human modeling, 3D avatar creation, and human reconstruction tasks

  • Access: Available for research upon request through the THuman project page

  • Links: - THuman2.0 GitHub: https://github.com/ytrock/THuman2.0-Dataset - Project Page: https://liuyebin.com/dataset.html

The quality of THuman2.0 (clean topology, detailed textures, varied poses) makes it valuable for creating realistic human renderings and for AR/VR content.

RenderPeople

RenderPeople is a commercial library of photorealistic 3D human models:

  • Content: Over 4,500 scanned people covering various ages, ethnicities, clothing, and activities

  • Scale: Each model includes high-resolution meshes with tens of thousands of polygons and 8K textures

  • Format: Models come in three categories: Posed People (static scans), Rigged People (with skeletal armatures), and Animated People (with pre-made motion capture animations)

  • Significance: The high realism and diversity make these models useful for generating synthetic data for training AI systems

  • Access: Commercial dataset with individual models purchasable from the website; free samples are available for testing

  • Links: - Free 3D People Samples: https://renderpeople.com/free-3d-people/

While commercial, RenderPeople has been used in research for data augmentation and as ground truth for human reconstruction algorithms.

Open-Source Implementations

Several open-source implementations for working with these datasets and 3D Gaussian Splatting include:

  1. The official 3D Gaussian Splatting implementation from the original authors (Kerbl et al.): https://github.com/graphdeco-inria/gaussian-splatting

  2. Community implementations with various extensions and optimizations

  3. Integration with frameworks like PyTorch3D for broader research applications

  4. Dataset loaders and utilities specific to each benchmark

These resources make it easy to get started with 3D Gaussian Splatting for research or applications.

Future Directions

Research in 3D Gaussian Splatting is rapidly evolving. While initial works have demonstrated the core capability of representing scenes with explicit 3D Gaussians at high quality and speed, there remain many open challenges and opportunities. This section outlines several promising future research directions, each addressing current limitations or enabling new applications.

Improved Compression Techniques for Memory Efficiency

As 3D Gaussian splat representations grow in number of primitives (often millions of Gaussians for a scene), memory and storage efficiency becomes a critical issue. Several compression approaches are being explored:

  • Redundancy Reduction: Pruning redundant Gaussians and quantizing parameters – recent work on Temporally Compressed 3D Gaussian Splatting showed that by selectively removing less important splats over time and using mixed-precision encoding, one can achieve up to 67× compression with minimal quality loss.

  • Hierarchical Encoding: Organizing Gaussians into multi-resolution clusters or an octree, so that distant or small-detail splats can be stored at lower precision or omitted until needed.

  • Attribute Compression: Compressing color and opacity attributes (e.g., using PCA or learned codebooks) to drastically cut down memory usage.

The goal is to enable lighter-weight models that can be transmitted and loaded efficiently, which is especially important for mobile or web applications. Moving forward, we expect techniques like on-the-fly streaming of Gaussians, delta encoding between frames (for dynamics), and better quantization of Gaussian parameters to make 3D Gaussian Splatting more memory-friendly.

Handling Dynamic and Deformable Scenes

While 3D Gaussian Splatting has excelled in static scenes, a frontier is dynamic or deformable scenes – where content moves, articulates, or changes over time. Extending Gaussian splats to the time domain raises questions of how to update the Gaussians’ parameters frame by frame or how to represent spacetime consistently.

Initial research has explored two main paradigms:

  1. Per-frame Gaussians with temporal correspondence: Allowing each Gaussian to move or appear/disappear over time

  2. Spatio-temporal Gaussians: Where each Gaussian represents a trajectory or a 4D volume in space-time

Recent works have begun to tackle these challenges:

  • Methods that introduce sparse control points to drive Gaussian deformation so that a smaller set of key points control clusters of Gaussians, effectively encoding motion fields with fewer parameters

  • Strategies that enforce consistency constraints, keeping Gaussian attributes coherent across frames

  • Combinations of skeletal motion models (like SMPL for human bodies) with Gaussians attached to the moving parts

Future research will likely develop hybrid models where an underlying motion field (possibly learned with an MLP) warps a static Gaussian configuration through time, enabling 4D Gaussian Splatting that can handle complex motions.

Advanced Material Modeling for Realistic Rendering

Originally, Gaussian splats use a simple emissive model (each Gaussian stores color and opacity, possibly with view-dependent spherical harmonics for slight view effects). This is efficient, but highly reflective or refractive materials are not handled accurately by the basic model.

Recent research directions include:

  • Advanced Shading Models: Adding shading functions to Gaussians to capture specular highlights and view-dependent reflections

  • Normal Estimation: Estimating a normal vector for each Gaussian (from the covariance shape or adjacent geometry) and then using a microfacet BRDF to modulate the color

  • Neural Reflectance: Using small neural networks to predict reflectance given view angle and a Gaussian’s properties

Another aspect is global illumination – handling shadows or interreflections between Gaussians. Future work might integrate techniques from point-based global illumination, assigning each Gaussian material properties and computing lighting with environment maps or real-time approximations.

A promising route is deferred rendering with Gaussians, where a first pass splats geometry and basic info, and a second pass computes lighting in image space. Adding richer material representation to 3D Gaussian Splatting will push its visual fidelity closer to traditional rendering, allowing scenes with mirrors, shiny cars, or translucent objects to be rendered correctly.

Hybrid Approaches Integrating Neural Fields and Explicit Representations

A compelling future direction is to combine the benefits of explicit Gaussians with implicit neural networks, forming a hybrid representation. Such approaches could:

  • Use a coarse neural field to guide or condition the placement of Gaussians

  • Use Gaussians as a fixed backbone with a small neural network refining details

  • Employ a neural feature field (e.g., a tri-plane or voxel grid) alongside Gaussians to encode high-frequency details

Another approach is progressive implicit-explicit modeling: start with an implicit coarse model of the scene and then spawn Gaussians in areas where fine detail is needed (edges, high texture regions), using the network to fill in what points alone miss.

There is evidence that hybrid models can yield better quality-speed trade-offs – e.g., a recent method combined a fast explicit point model with a small learned component to improve generalization. This direction is also promising for generalization across scenes: a neural field (with learned scene-agnostic features) could help Gaussians represent new scenes without per-scene optimization.

Scalability for Large-Scale Scenes (City-Level and Beyond)

Scaling 3D Gaussian Splatting to very large scenes (e.g., an entire city block or a forest) presents challenges in both representation size and rendering efficiency. Future research is focusing on methods to handle unbounded environments and vast numbers of Gaussians gracefully.

Key approaches include:

  • Level-of-Detail (LOD): A city-level scene can be divided into sectors or an octree grid; distant sectors are represented with fewer/lower-resolution Gaussians, while near sectors use the full detail

  • Streaming: As the camera moves, Gaussians could be streamed in and out of memory for the regions entering or leaving the view

  • Divide-and-Conquer Training: Methods like CityGaussian employ a divide-and-conquer training approach and multi-LOD representation to reconstruct an urban scene

  • Optimized Rendering: Techniques like frustum culling for Gaussians, clustering splats into tiles, and GPU acceleration through compute shaders can keep frame rates high

  • Foveated Rendering: Where the density of splats is higher in the viewer’s focus area and lower in the periphery

Combining 3D Gaussian Splatting with mapping systems (like converting city GIS data to Gaussians, or starting from photogrammetry models and then optimizing Gaussians) could also be explored.

Real-Time Applications in AR/VR and Gaming

One of the most exciting directions is pushing 3D Gaussian Splatting toward real-time performance for interactive applications such as augmented reality, virtual reality, and video games.

This involves optimizing both the rendering pipeline and the scene representation:

  • Hardware Acceleration: Exploiting dedicated GPU features – for example, using point cloud rendering pipelines or even hardware ray tracing cores to accelerate splat rasterization

  • Web and Mobile Implementation: Porting Gaussian Splatting to run in the browser or on mobile GPUs (WebGL/WebGPU implementations)

  • Low-Latency Updates: Finding fast methods to update Gaussians in response to tracking data or user input, potentially by training adaptive networks that refine splats on the fly

In gaming, 3D Gaussian Splatting could be used to quickly capture and insert real-world scenes into game engines, or to render large crowds/vegetation with less performance cost than detailed polygonal models.

Over the next few years, we anticipate research that tightens the feedback loop between splat-based representations and interactive systems, enabling smooth, real-time experiences where users can move through photorealistic environments or interact with neural objects without precomputation.

Integration with Existing Graphics Pipelines

Another area of future research is better integration of Gaussian Splatting with traditional graphics pipelines:

  • Engine Integration: Incorporating Gaussian Splatting into game engines like Unity and Unreal Engine

  • Hybrid Rendering: Combining traditional rasterization of hard surfaces with Gaussian Splatting for complex phenomena like hair, vegetation, or smoke

  • Editing Tools: Developing intuitive interfaces for editing and manipulating Gaussian-based scenes

This integration would allow developers to leverage the benefits of Gaussian Splatting while still using familiar tools and workflows.

Learning from Limited Data

Current 3D Gaussian Splatting methods typically require multiple images of a scene from different viewpoints. Future research could focus on:

  • Few-Shot Learning: Reconstructing a scene from just a few images or even a single image

  • Cross-Scene Priors: Leveraging knowledge from previously seen scenes to better reconstruct new scenes

  • Text-to-3D: Using text descriptions to guide the generation or refinement of Gaussian-based scenes

These advances would make the technology more accessible and usable in scenarios where capturing multiple views is impractical.

In summary, while 3D Gaussian Splatting has already made significant strides in real-time, high-quality novel view synthesis, these future directions promise to expand its capabilities, efficiency, and applicability across numerous domains.

Conclusion

3D Gaussian Splatting represents a paradigm shift in novel view synthesis, successfully bridging the gap between neural rendering quality and traditional graphics efficiency. By leveraging explicit 3D Gaussian primitives that can be efficiently rendered through a specialized pipeline, this approach has fundamentally transformed how we represent and render 3D scenes.

The key contributions of 3D Gaussian Splatting can be summarized as follows:

  • Unprecedented Performance-Quality Balance: Achieving real-time rendering speeds (30-135 FPS at 1080p resolution) while maintaining visual quality comparable to or exceeding state-of-the-art neural radiance fields

  • Accelerated Training: Reducing optimization time from hours or days to just minutes, enabling rapid scene reconstruction and iteration

  • Mathematically Elegant Framework: Providing a unified approach that combines the volumetric integration of NeRF with the efficiency of point-based rendering through a differentiable, analytically sound formulation

  • Explicit Representation: Making scenes directly editable and manipulable, unlike the “black box” nature of neural network approaches

  • Adaptive Detail: Automatically adjusting the density and distribution of primitives to match scene complexity through its adaptive optimization algorithm

The impact of 3D Gaussian Splatting extends far beyond technical improvements. It has profound implications for multiple domains:

In Computer Vision and Graphics Research: Gaussian Splatting has opened new research directions at the intersection of neural and traditional rendering, demonstrating that explicit representations can match or exceed neural networks for certain tasks. This challenges the previous assumption that implicit neural representations were necessary for high-quality novel view synthesis.

In Content Creation and Media: The ability to quickly reconstruct a photorealistic, editable 3D scene from a set of images has transformative potential for film production, game development, and virtual production. Artists and designers can now capture real-world environments and rapidly integrate them into creative workflows.

In Immersive Technologies: For virtual and augmented reality, Gaussian Splatting’s real-time performance enables truly photorealistic virtual environments that can run on consumer hardware, potentially solving the long-standing challenge of achieving both realism and responsiveness in immersive experiences.

In Practical Applications: Fields such as cultural heritage preservation, architectural visualization, e-commerce, and remote collaboration can leverage this technology to create more accurate, accessible, and useful digital replicas of real-world objects and environments.

The development of 3D Gaussian Splatting also illustrates a broader trend in computer graphics and vision: the convergence of neural and traditional approaches. Rather than neural networks replacing traditional graphics techniques, we’re seeing the emergence of hybrid methods that combine the best aspects of both paradigms. Gaussian Splatting exemplifies this fusion, taking inspiration from classic point-based rendering while incorporating modern optimization techniques and the volumetric rendering framework from neural fields.

Looking ahead, as outlined in the Future Directions section, we can expect rapid advancements in addressing current limitations such as memory efficiency, material modeling, and dynamic scene handling. The ongoing research will likely lead to even more powerful representations that can handle increasingly complex scenarios while maintaining real-time performance.

In conclusion, 3D Gaussian Splatting marks a critical milestone in the evolution of 3D scene representation and rendering. By striking an optimal balance between quality, speed, and controllability, it has not only advanced the state of the art in novel view synthesis but has also made photorealistic neural rendering practical for real-world applications. As the technology continues to mature and combine with other approaches, we can expect it to become a fundamental component in the toolkit of computer vision researchers, graphics engineers, and content creators across multiple industries.

Glossary

Alpha Blending

The process of combining a foreground color with a background color based on the foreground’s alpha (opacity) value.

Alpha Compositing

Combining multiple semi-transparent layers to form a final image, typically using the “over” operator.

Anisotropic Gaussian

A Gaussian distribution with different variances along different axes, forming an ellipsoid rather than a sphere.

Covariance Matrix

A symmetric positive semi-definite matrix that defines the shape, size, and orientation of a Gaussian distribution.

Densification

The process of adding new Gaussians to improve scene representation in areas with high reconstruction error.

Differentiable Rendering

A rendering process where gradients of the output image with respect to scene parameters can be computed, enabling optimization.

EWA (Elliptical Weighted Average)

A technique for high-quality filtering when projecting 3D elements to 2D, used in point-based rendering to avoid aliasing.

Implicit Representation

A scene representation that defines geometry or appearance as a continuous function, typically implemented with a neural network.

Jacobian

A matrix of partial derivatives representing the local linear approximation of a function (in this context, the projection from 3D to 2D).

L1 Loss

A loss function that measures the absolute difference between predicted and target values, often used in image comparison.

Light Field

A function that describes the amount of light flowing in every direction through every point in space.

Linear Blend Skinning (LBS)

A technique for deforming a mesh according to an underlying skeleton, where each vertex is influenced by multiple joints.

Multi-View Stereo (MVS)

A technique to reconstruct dense 3D geometry from multiple images with known camera positions.

NeRF (Neural Radiance Field)

A method that uses a neural network to represent a scene as a continuous function mapping 3D coordinates and viewing directions to color and density.

Novel View Synthesis

The task of generating new images of a scene from viewpoints that were not part of the original capture.

Parametric Model

A 3D model controlled by a small set of parameters, such as SMPL for human bodies.

Point Cloud

A set of data points in 3D space, typically representing the external surface of an object.

Pruning

The process of removing Gaussians with negligible contribution to improve efficiency.

Quaternion

A four-dimensional extension of complex numbers, often used to represent 3D rotations without gimbal lock.

Radiance Field

A function that defines the color and density at every point in 3D space, potentially varying with viewing direction.

Rasterization

The process of converting vector graphics (like triangles or points) into a raster image (pixels).

Ray Marching

A technique used in volume rendering where rays are sampled at discrete steps to accumulate color and opacity.

Ray Tracing

A rendering technique that simulates the physical behavior of light by tracing rays from the camera through the scene.

SMPL (Skinned Multi-Person Linear Model)

A parametric model of the human body that can be controlled with a small number of shape and pose parameters.

Spherical Harmonics

A set of functions that form an orthogonal basis for representing functions on a sphere, used to encode view-dependent appearance.

Splatting

A rendering technique where each element (point, Gaussian, etc.) is projected onto the image plane as a small disk or ellipse.

SSIM (Structural Similarity Index Measure)

A perceptual metric that quantifies image quality degradation based on structural information, luminance, and contrast.

Structure-from-Motion (SfM)

A technique to estimate 3D structures and camera motion from a sequence of 2D images.

Tile-Based Rendering

A rendering approach that divides the screen into tiles and processes each tile independently to improve efficiency.

Transmittance

The fraction of light that passes through a medium without being absorbed or scattered, used in volumetric rendering.

Volume Rendering

A technique for rendering a 2D projection of a 3D discretely sampled dataset, typically by ray marching.

Voxel Grid

A 3D grid of volumetric elements (voxels) that discretizes a 3D space.