Lecture 09.1: Neural Implicit and Point-Based Representations for Clothed Human Modeling
Lecture Slides: Neural Implicit and Point-Based Representations for Clothed Human Modeling
Introduction
Modeling clothed human bodies in 3D is a longstanding challenge in computer vision and graphics. Traditional approaches often rely on mesh-based models like SMPL (Skinned Multi-Person Linear model), which represent the human body with a fixed-topology triangulated mesh and blend skinning for animation. While mesh models are efficient and interpretable, they struggle to capture the complex geometry and varying topology of clothing (e.g., loose garments, layers) without significant manual effort or template modifications.
Recent advances have introduced alternative representations – notably neural implicit functions and point-based representations – which promise greater flexibility in representing high-detail, varying-topology surfaces like clothing. This lecture explores these emerging representations for clothed human modeling, including signed distance fields (SDFs), occupancy fields, Neural Radiance Fields (NeRFs) adapted to humans, and point-based methods. We compare these approaches to traditional mesh models in terms of expressiveness, differentiability, data efficiency, and animation performance, examining both their theoretical foundations and practical applications.
Background: Explicit vs. Implicit vs. Point-Based Representations
To understand the landscape of 3D human modeling approaches, we first need to differentiate between the main categories of representations.
Mesh-Based Models
Meshes represent surfaces by a set of vertices connected in a fixed topology. For human bodies, parametric mesh models (e.g., SMPL) provide low-dimensional controls for shape and pose, and clothing can be added via displacements on the base body mesh.
Advantages: - Intuitive and compatible with existing graphics pipelines - Explicit surface definition makes collision handling and rendering straightforward - Efficient skinning methods (e.g., Linear Blend Skinning) enable fast animation
Limitations: - Fixed genus/topology makes loose garments difficult to model without cutting/stitching mesh parts - Fixed resolution (number of vertices) limits detail without increasing memory requirements - Handling complex cloth deformations or topological changes is cumbersome - Self-intersections become more likely with higher vertex counts
Despite these limitations, mesh models remain widely used due to their compatibility with existing tools and efficient animation properties.
Point-Based Models
Point clouds represent surfaces as an unstructured set of points in 3D (often with normals or colors). They can be seen as an intermediate between explicit and implicit representations.
Advantages: - Can capture arbitrary topology and fine detail (each point samples the surface geometry) - No connectivity requirements allow representation of complex clothing geometries (holes, layers) - Compatible with neural networks (e.g., set or convolutional architectures) - Points can store local features encoding shape or deformation properties
Limitations: - Lack of connectivity means surfaces are “hollow” collections of points - Special techniques (e.g., splatting or meshing) needed for rendering or collision detection - Resolution is finite, though one can sample many points for high detail
Point-based representations have gained interest as they can achieve high resolution and topological flexibility like implicit functions, but with easier integration into standard graphics pipelines.
Neural Implicit Representations
Implicit representations define 3D surfaces as the level sets of a continuous function (typically parameterized by a neural network) defined over \(\mathbb{R}^3\). Two common implicit functions are signed distance fields (SDFs) and occupancy fields.
Advantages: - Represent arbitrary topologies since level sets can split or merge as needed - Continuous representation not tied to a particular resolution - Differentiable with respect to geometry, enabling optimization - Can capture fine details with sufficient network capacity
Limitations: - Computationally heavy to train and evaluate - Extracting a mesh requires expensive operations (e.g., Marching Cubes) - Collision detection or physical simulation is non-trivial without conversion to explicit form
Neural implicit models represent surfaces implicitly as the zero-level set of a neural function \(f_\theta(\mathbf{x})\). For example, an occupancy network outputs a probability of point \(\mathbf{x}\) being inside the object; the surface is the decision boundary. An SDF network outputs the signed distance to the surface, with the surface at \(f_\theta(\mathbf{x})=0\).
Neural Radiance Fields (NeRFs) for Humans
NeRF is a special implicit volumetric representation originally for view synthesis – a network \(F_\theta(\mathbf{x},\mathbf{d})\) outputs the color and density at any 3D point \(\mathbf{x}\) (for a ray in direction \(\mathbf{d}\)), and rendering is done by volumetric integration.
For human modeling, NeRFs have been extended to handle dynamic articulation by conditioning on pose or learning deformation fields. While NeRFs primarily target appearance, the learned density essentially represents the shape implicitly. NeRF-based models can capture realistic details of clothing (texture, fine wrinkles) and allow novel view rendering. However, extracting geometry from a NeRF is an extra step (e.g., by thresholding density).
This topic is covered in more detail in the next section as well as the supplementary materials Extended Materials: Neural Radiance Fields.
Hybrid Approaches
Some state-of-the-art methods combine representations. For example, SCARF (SIGGRAPH Asia 2022) is a hybrid of an explicit body mesh (for the human body) and a neural radiance field for the clothing layer. By integrating a deformable mesh for the body into the NeRF optimization, SCARF achieves a representation where the body pose and shape are controlled by the mesh, and the clothing (which can have complex geometry) is learned implicitly.
Such hybrid models can be optimized from monocular video directly (using differentiable rendering) and can even transfer clothing between subjects. This illustrates that the lines between representation types are often blurred in practice – e.g., one can use an implicit model for cloth on top of an explicit body.
Expressiveness and Topology
A key criterion is how well the representation captures the geometry complexity of clothed people. Meshes with fixed topology struggle to represent clothes that are not snug to the body – e.g., pants and shirts require different mesh connectivities. Traditional approaches created separate garment templates (e.g., a shirt template, a pants template) and learned deformation on each, making it hard to handle new garment types or multiple layers simultaneously.
Neural implicit functions, by contrast, do not require a template mesh and can seamlessly represent different topologies within one model. The SMPLicit model by Corona et al. demonstrates this: it jointly represents a wide variety of garment types (T-shirts, hoodies, jackets, skirts) with a single implicit function, whereas earlier methods needed one model per garment type.
Point-based models can also capture varying topology in a single representation because points are not connected – POP, for example, is a “cross-outfit” model learned from many outfits and can animate an outfit of arbitrary style after fitting its point cloud.
Thus, in terms of expressiveness and topology: implicit and point-based representations offer greater flexibility than single-template meshes, enabling one model to cover diverse clothing geometries.
Differentiability and Learning
Mesh models are explicit and often low-dimensional (e.g., SMPL has ~10 shape parameters, 69 pose parameters), so they are very data-efficient for minimal-clothed bodies and easy to fit with optimization. However, to capture clothing, mesh approaches either treat cloth as separate high-dimensional geometry (hard to learn without templates) or as displacements on the body (limited to tight clothing).
Neural implicit models, being high-dimensional function representations, typically require more data (e.g., many 3D scans or images) to train. But once trained, they are fully differentiable with respect to shape/pose parameters or even the input geometry itself. For instance, SMPLicit learns a latent space for clothing shape that is semantically interpretable (controlling garment size, length, etc.), and the whole pipeline is differentiable, allowing fitting to scans or images via gradient descent.
Point-based models like POP also require training on many scans to learn general clothing deformation behavior. A big advantage of neural representations is that they can generalize or interpolate; e.g., Neural-GIF and SNARF can generalize to novel poses not seen in training, something a simple blendshapes rig might not handle without explicit pose corrective data.
In terms of differentiability: - The function \(f_\theta(\mathbf{x})\) in implicit models is differentiable w.r.t \(\mathbf{x}\) and \(\theta\), enabling gradient-based algorithms for inverse problems - Meshes have discrete vertices, so gradients exist w.r.t vertex positions but the topology is not editable - Point clouds are somewhat differentiable – one can move points by gradients – but lack a continuous surface definition
Many modern methods combine neural implicits with gradient-based fitting. For example, when SMPLicit is fit to a new scan, the algorithm optimizes the clothing latent code and the implicit surface such that the unsigned distance field matches the scan points. This is possible because the model is end-to-end differentiable. In contrast, fitting a cloth template mesh might require non-differentiable operations like re-meshing if topology differs.
Data Efficiency and Performance
Mesh parametric models are extremely data-efficient for the shapes they represent (e.g., SMPL was learned from ~1700 body scans but generalizes well to new people’s body shapes within that distribution). However, to extend them to clothing (which has high variability), collecting a similar parametric basis is difficult.
Neural implicit models can absorb much more data. Many recent works train on tens of thousands of scans or synthetic data (e.g., SMPLicit leverages the CAPE dataset and others). Training these models is computationally intensive (often requiring days on GPUs). Additionally, querying an implicit surface can also be slower than a mesh: e.g., to render or simulate, one might need to evaluate the network many times.
Point-based models like POP can be faster at inference: since the surface is represented by a fixed set of points, rendering as surfels or point clouds is relatively quick, and animation involves moving points which can be done in parallel.
For animation and dynamic performance: - Meshes with LBS are extremely fast to pose – just apply bone transforms to each vertex (linear complexity in number of vertices) - Neural implicit avatars require either solving for correspondences (SNARF’s root-finding for each query point) or warping a grid (Neural-GIF’s deformation field) which is slower - Point-based models can also be animated efficiently if each point stores skinning weights associated to the body
Some methods precompute structured latent grids or use spatial data structures to speed up querying implicit surfaces (e.g., octrees or multi-resolution feature grids in convolutional methods).
In summary, neural implicit models offer unparalleled flexibility and detail at the cost of training complexity and runtime, whereas point-based models offer a middle ground with high flexibility and easier integration, and mesh models are the fastest and most convenient for certain tasks but lack the capacity to represent complex clothing geometry.
Neural Implicit Function Foundations
Neural implicit functions represent surfaces by a continuous field \(f(\mathbf{x})\) defined for any 3D point \(\mathbf{x}=(x,y,z)\in\mathbb{R}^3\). The surface is a particular level-set of \(f\). We focus on three boundary types:
Signed Distance Field (SDF)
\(f(\mathbf{x})\) outputs the signed distance to the surface. By convention: - \(f(\mathbf{x})<0\) if \(\mathbf{x}\) is inside the object - \(f(\mathbf{x})>0\) if outside - \(f(\mathbf{x})=0\) exactly on the surface
Thus \(S = \{\mathbf{x} \mid f(\mathbf{x})=0\}\). The classic SDF has the property \(|\nabla f|=1\) almost everywhere (the gradient’s magnitude equals 1 in continuous space), which is the Eikonal equation satisfied by distance functions.
DeepSDF (Park et al., 2019) introduced using a neural network to represent such a continuous SDF for a class of shapes. In DeepSDF, a decoder network \(f_\theta(\mathbf{x}, \mathbf{z})\) takes spatial point \(\mathbf{x}\) and a latent code \(\mathbf{z}\) (representing a specific shape instance) and outputs an SDF value. This enables learning a family of shapes. In a human context, \(\mathbf{z}\) could represent different identities or clothing styles.
The surface can be extracted via root-finding (find \(\mathbf{x}\) such that \(f_\theta(\mathbf{x},\mathbf{z})=0\)) or by sampling a grid and applying Marching Cubes. Importantly, SDFs allow easy normal computation: at any surface point, the gradient \(\nabla_\mathbf{x} f\) is the outward normal.
Some methods use unsigned distance fields (UDFs) instead, where \(u(\mathbf{x}) \ge 0\) is the distance to the surface without an inside/outside distinction. UDFs are helpful when the concept of “inside” vs “outside” is ambiguous (e.g., clothing layers).
Occupancy Field (Indicator Function)
\(f(\mathbf{x})\) outputs \(1\) if \(\mathbf{x}\) is inside the volume of the object, and \(0\) if outside (for a binary occupancy field). In practice, most use a continuous occupancy probability or logit in [0,1]. The surface is implicitly where the occupancy probability drops from inside to outside.
Occupancy Networks (Mescheder et al., 2019) popularized learning a neural implicit surface this way. They treat the network as a classifier: \(f_\theta(\mathbf{x})\) = probability \(\mathbf{x}\) is inside the object. During training, they sample points from mesh data and label them 1 (inside) or 0 (outside) and use a binary cross-entropy loss.
One big advantage is that you don’t need ground-truth distance values (just an inside/outside label, which is easier to get from a mesh or scan). However, occupancy networks are not as straightforward to convert to a distance or to get normals (though one can approximate normals by the gradient of the logit field).
Volume Radiance Field
Although NeRFs primarily target novel view synthesis, from a modeling perspective a NeRF learns a function \(F(\mathbf{x},\mathbf{d}) = (\sigma, C)\) giving a volume density \(\sigma\) (how much volume/matter is at point \(\mathbf{x}\)) and color \(C\) for a ray hitting \(\mathbf{x}\) from direction \(\mathbf{d}\).
The object’s shape is represented by the density field \(\sigma(\mathbf{x})\): surfaces manifest as regions where \(\sigma\) changes (ideally, a surface would be an infinitesimally thin shell of high density). Rendering is done by integrating along camera rays and using the optical model:
where \(T(t)=\exp(-\int_0^t \sigma(\mathbf{r}(s))ds)\) is transmittance.
For human modeling, one approach is to incorporate pose into \(\sigma\)’s definition. A common technique (as in HumanNeRF and related works) is to define the density in a canonical pose, and learn a deformation field that maps each query point in deformed (posed) space back to the canonical space before evaluating \(F\).
The result is the ability to generate novel views for any frame (pose) of the video. Geometrically, a NeRF can be seen as an implicit surface at any desired threshold of density (some methods extract meshes from NeRFs by picking a density threshold and running Marching Cubes on the learned \(\sigma(\mathbf{x})\)).
Articulated Deformation Fields for Implicit Models
A central challenge in applying implicit representations to humans is handling articulation – the human body moves with many degrees of freedom, and clothing deforms with those movements. In a mesh model like SMPL, articulation is handled by Linear Blend Skinning (LBS): each vertex is attached to the skeleton with precomputed skinning weights, and given a pose (joint angles), one computes the transformation of each bone and blends them to find the new vertex position.
Two general strategies have emerged for implicit models:
Backward Warping (Inverse Skinning Field)
This approach defines a function \(W^{-1}_\theta: \text{deformed space} \to \text{canonical space}\) that maps any point \(\mathbf{p}'\) in the posed space back to a point \(\mathbf{p}\) in the canonical space (e.g., T-pose) given pose \(\theta\).
To query if \(\mathbf{p}'\) is on the surface in pose \(\theta\), we map it back to canonical space, get \(\mathbf{p}=W^{-1}_\theta(\mathbf{p}')\), and then evaluate the implicit function \(f_\text{canonical}(\mathbf{p})\). The equation for the surface in posed space would be:
Many early works (like NASA or the first implicit clothed models) opted to directly learn \(W^{-1}_\theta\). That is, they train a network that given \((\mathbf{p}', \theta)\) predicts \(\mathbf{p} = W^{-1}_\theta(\mathbf{p}')\). This network effectively learns the occupancy or SDF field in canonical space and how that field moves with pose.
For example, Neural-GIF uses an inverse map: it maps every query point in posed space to canonical, applies a learned deformation (for non-rigid cloth deformations), then evaluates the SDF. In Neural-GIF, the mapping is factorized: first an inverse LBS using predicted skinning weights to get an initial \(\mathbf{p}\), then a learned displacement \(\delta \mathbf{p}(\mathbf{p}, \theta)\) in canonical space to account for pose-dependent offsets (like cloth wrinkles or body bulging).
Backward warping is intuitive and directly aligns with data: given a point on a posed scan, you can try to map it to canonical and enforce the occupancy to be inside. Its weakness is that \(W^{-1}_\theta\) itself is pose-dependent (a different function for each pose), making it hard to generalize to poses not in the training set.
Forward Warping (Forward Skinning Field)
Instead of mapping \(\mathbf{p}' \to \mathbf{p}\), we define a field in canonical space that tells us where each canonical point goes when posed. Formally, \(W_\theta: \text{canonical space} \to \text{deformed space}\), e.g., given a canonical point \(\mathbf{p}\) (in T-pose coordinates), \(W_\theta(\mathbf{p}) = \mathbf{p}'\) is its location under pose \(\theta\).
With a forward map, the surface in posed space can be obtained by warping the entire canonical surface:
But to query whether a given \(\mathbf{p}'\) is on the surface, we need to find if there exists some \(\mathbf{p}\) such that \(\mathbf{p}' = W_\theta(\mathbf{p})\) and \(f_\text{canonical}(\mathbf{p})=0\). That requires solving \(\mathbf{p} = W_\theta^{-1}(\mathbf{p}')\) anyway.
The key is to make \(W_\theta\) simpler (pose-independent) so that its inverse can be solved numerically in a stable way. SNARF (Chen et al., ICCV 2021) is the landmark method that does this. SNARF defines a forward skinning field in canonical space: basically, a set of skinning weights \(w_b(\mathbf{p})\) for each canonical coordinate \(\mathbf{p}\), similar to SMPL’s weights but continuous. These weights are pose-independent (a function of \(\mathbf{p}\) only).
Given a pose (bone transforms \(G_b(\theta)\)), the forward warp is:
This looks like LBS, but \(w_b\) are given by a neural network (conditioned on \(\mathbf{p}\) coordinates) instead of pre-defined by a template. The benefit is that \(w_b(\mathbf{p})\) does not change with pose, so the network just learns one skinning weight field for the whole space. This field can generalize to new poses because it doesn’t directly encode pose; pose only comes into play in the known linear blend formula.
The challenge is that to train it, one must solve the inverse problem: given a point \(\mathbf{p}'\) in posed space (from a training mesh), find all \(\mathbf{p}\) in canonical such that \(W_\theta(\mathbf{p})=\mathbf{p}'\). SNARF tackles this by iterative root finding with implicit differentiation to propagate gradients.
Essentially, for each query point \(\mathbf{p}'\), they run a small iterative solver (like Newton or bisection in each bone’s local coordinate) to find a canonical \(\mathbf{p}\) that could produce it.
Pose-Dependent Deformation (Secondary Motion)
Whether using forward or backward skinning for the bulk articulation, real clothing also has non-rigid deformations dependent on pose (think of how a skirt flares when legs move, or wrinkles form at the elbow when arm bends). To model this, most frameworks include an additional deformation field or correctives.
In SMPL (meshes) this is handled by pose-blendshapes – vertex offsets as a function of joint angles. In implicit models, we incorporate a learned function \(\delta \mathbf{p}(\mathbf{p}, \theta)\) that displaces points in canonical space before evaluating the SDF/occupancy.
Neural-GIF explicitly does this: after inverse warping a point to canonical, they add a learned offset \(\delta \mathbf{p}\) (the “non-rigid deformation”) then evaluate the signed distance. This effectively allows the canonical shape to morph depending on pose, producing folds or muscle bulges.
Similarly, SNARF conditions its implicit shape on pose in a way that captures local changes (they achieve this by inputting local pose features to the occupancy MLP so the implicit surface itself is pose-conditioned, even though the skinning weights are static).
SCANimate introduced locally pose-aware implicit functions: instead of giving the network a global pose code, they feed per-bone pose features to different parts of the space, which reduced spurious correlations and improved generalization.
Generative Implicit Models for Clothed Bodies
Early neural implicit works on humans focused on static shape representation – learning a space of plausible human shapes (with or without clothing) that can be sampled or fit to data. These models are generative in nature, capturing the distribution of human geometry. They often build upon or replace parametric models like SMPL.
SMPLicit: Topology-Aware Clothed Human Model
SMPLicit (Corona et al., CVPR 2021) is a generative model that combines the SMPL parametric body with a neural implicit surface to represent clothing of various types. The motivation is to have a single learned model that can dress the SMPL body in anything from tight T-shirts to flowing skirts or coats, without needing separate templates for each garment topology. It is “topology-aware” in that it can handle different clothing topologies (even multiple layers) in one framework.
Representation: SMPLicit uses an unsigned distance field (UDF) to represent the clothed surface. The implicit function \(C_\Theta(\mathbf{p})\) takes as input a 3D point \(\mathbf{p}\) and outputs the distance to the clothing surface. A distance of zero means \(\mathbf{p}\) lies on the garment surface.
The UDF is defined in the local coordinates of the SMPL body – specifically, they augment \(\mathbf{p}\)’s representation with information about the underlying body. During training, \(\mathbf{p}\) is expressed relative to the nearest point on the SMPL body or in the local coordinate of a body part.
Additionally, SMPLicit has a latent code that represents the clothing shape/style. This latent is divided into “cut” (geometry) and “style” components in the architecture. The cut latent encodes which garment and its general shape (e.g., long-sleeve vs short-sleeve), and style latent encodes finer details or looseness.
Architecture: The SMPLicit network is a feed-forward decoder that outputs UDF value. It is trained in two modes simultaneously: 1. Given a ground-truth garment mesh on a SMPL body, they render a form of “occlusion map” and encode that to predict an initial latent (“cut”). 2. They also use an auto-decoder that directly optimizes a latent code per training example (this latent captures additional style aspects).
So the final latent fed to the network is a combination of the predicted cut-code from the occlusion map encoder and the learned style code.
Training data: SMPLicit was trained on a large set of synthetic and real scans of people in clothing (T-shirts, jackets, pants, etc., including multiple layers). They likely used datasets like CAPE (which provides registered clothed meshes) and others, possibly augmented with synthetic garments.
For each training example, they have the SMPL parameters (body shape and pose) and a clothed surface mesh. They sample points around the surface (both near-surface and in free space). The ground truth for a sample point can be the unsigned distance to the mesh. They train \(C_\Theta\) to minimize the error in predicted distance.
Skinning and Output: Once trained, SMPLicit can generate a clothed shape by querying \(C_\Theta\) on a 3D grid around the body and extracting the surface via Marching Cubes. This gives a garment mesh in canonical pose.
To animate the output garment on the SMPL skeleton, they do a simple post-process: for each vertex of the extracted garment mesh, find the nearest SMPL body vertex and copy its skinning weights. Then use SMPL’s LBS to skin the garment. This means SMPLicit’s garments will follow the body’s movements without secondary motion (like a parented object).
Uses and performance: SMPLicit is fully differentiable, which is a major advantage. The authors show two applications: 1. Fitting to 3D scans (take a raw scan of a person in clothes, optimize the latent code and possibly body shape so that SMPLicit’s surface matches the scan) 2. 3D reconstruction from images (starting from an image, use the encoder to get a cut latent from silhouettes and optimize further to match the image evidence)
In both cases, because the model is generative, it can produce plausible clothes even in unseen configurations. It also allowed interactive editing: since the latent has semantic directions (tightening a garment corresponds to moving latent in some direction), one can change the garment size or length in the latent space and regenerate the mesh.
imGHUM: Implicit Generative Human Model
Around the same time, imGHUM (Implicit Generative HUMan model) was proposed by Google AI (Xu Chen et al., 2021). While SMPLicit focuses on clothed bodies, imGHUM aims to be a holistic implicit model of the human body (minimally clothed, with a separate implicit head and hands model).
imGHUM represents the entire body (including detailed face and fingers) as a signed distance function. It is “holistic” in that it doesn’t treat clothing separately from body; it’s more akin to a replacement for SMPL that is continuous and high-resolution. imGHUM introduced a continuous SDF model with articulated joints that can be sampled to extremely high detail and is fully differentiable. They also built a generative prior (allowing sampling new identities) using a normalizing flow on the latent space.
For our cloth-focused discussion, imGHUM is mainly notable as an implicit body model; it doesn’t explicitly handle arbitrary outfits (it was trained on scans of minimally-clothed subjects, plus some with tight clothing).
Key difference from SMPLicit: SMPLicit keeps SMPL in the loop (for pose, shape, skinning after), and focuses on clothing geometry. imGHUM tries to model the human (body shape and pose) entirely in the implicit function, relying less on an external parametric model (though they still condition on pose like joint angles).
Pose-Dependent Implicit Models and Animatable Avatars
Animating a clothed human implicit model means making it respond correctly to new pose inputs. Several influential methods have made progress on this front.
NASA: Neural Articulated Shape Approximation (ECCV 2020)
NASA is one of the earliest to bring neural implicits to articulated objects (especially humans). It uses a part-based occupancy representation. The idea is to approximate a complex articulated shape by a union of simple implicit parts attached to a skeleton.
For each bone of the skeleton, NASA learns a local implicit shape (an occupancy field) in that bone’s local coordinate system. At pose time, it transforms sample points into each part’s coordinate frame (using the known skeletal transforms) and queries the occupancy. If a point is inside any part’s implicit shape, it’s considered inside the whole shape.
NASA’s network outputs an occupancy probability for each part given a point and the pose; effectively a set \(\{o_b(\mathbf{x})\}\) for \(b=1...B\) parts. These are combined (via max or sum) to yield the final occupancy.
NASA was trained on minimally clothed human scans with known SMPL fits, so they could get ground-truth part labels (each vertex belongs to a SMPL part). A limitation is that clothing which bridges parts (like a skirt spanning both legs) cannot be represented unless assigned to a single part or split.
NASA’s significance is showing that differentiable occupancy as a function of pose is possible by conditioning on joint angles. It uses a relatively small amount of data because each part’s shape is simpler. However, it inherits the drawback of backward mapping: each part’s occupancy is defined in that part’s canonical space, but the blend at inference time is somewhat heuristic.
SCANimate: Weakly-Supervised Skinned Avatar Networks (CVPR 2021)
SCANimate (Saito et al., 2021) is a milestone because it showed that one can learn an animatable clothed human directly from raw 3D scans without explicit correspondences or a template mesh. It is “weakly-supervised” in that it doesn’t require ground-truth canonical pose alignments of the scans or part labels – it only needs the scans and the ability to fit a parametric body to each scan (which gives a rough pose).
SCANimate’s pipeline is as follows:
They have a set of raw scans of a person in various poses, each scan is just a point cloud or mesh with no point-to-point correspondence across poses. They first fit a SMPL body to each scan (aligning pose, maybe shape). This gives them a common skeleton and a rough alignment (but the scan’s surface isn’t registered to a template).
They leverage two key observations: - Fitting SMPL to a clothed scan is tractable (the body can be aligned under the clothes), but doing a full surface registration of a template to the clothed scan is hard (clothing breaks correspondence). - Articulated transformations are invertible – meaning if you have an implicit model, you can move from posed to canonical and back, allowing a cycle consistency.
SCANimate essentially learns a backward warp (inverse skinning) to canonical, similar to Neural-GIF’s idea, but they enforce cycle consistency: if you warp a scan to canonical and then forward warp it back, you should recover the original scan. They do this by training a neural network that extends LBS to 3D space. This network learns the inverse warp \(W^{-1}_\theta\) such that when applied to all points of a posed scan, the points come into a coherent canonical pose alignment.
They introduce a locally pose-aware implicit function for the geometry. Instead of a single global latent or global pose vector, they condition the implicit surface on pose in a local manner. Concretely, for each query point, they incorporate features like which bones influence it and how. This reduces artifacts where an arm’s pose might erroneously affect a far part of the clothing.
They supervise the training weakly: use the scans (point clouds) and try to reconstruct them in canonical space and in posed space. Missing regions in scans are completed by the network. Through training, the network learns a canonical shape (with the clothing) and the deformation field to map it to any pose, without ever having a template registration.
The result is an animatable implicit avatar that can be driven by pose parameters to produce new poses with realistic clothing deformation. SCANimate also can optionally model appearance (they mention it can be extended to textured avatar), but geometry is the main focus.
One notable thing: SCANimate still ultimately uses a SMPL model as a helper. It uses SMPL’s pose to condition the network and presumably uses SMPL’s skinning as an initial guess for warping. But it does not require the clothing topology to be pre-defined. It even allows different amounts of training data – they showed it works with as few as ~10 scans or as many as 100+, and more data yields better generality.
Neural-GIF: Neural Generalized Implicit Functions (ICCV 2021)
Neural-GIF (Tiwari et al., 2021) builds heavily on the idea of separating pose-dependent deformations from the base shape in an implicit framework. It is conceptually similar to SCANimate but with a more explicit formulation of the deformation field and without requiring scan alignment via cycle consistency. The method was described earlier in our discussion of backward vs forward warp: it maps each query point to canonical space, applies a learned deformation, then evaluates an SDF.
Neural-GIF assumes we have a set of poses of a single subject (like scans or even synthetic data), similar to SCANimate but perhaps they also support multi-subject in an extended version.
They factorize motion by articulation vs non-rigid: - They use the SMPL model’s known skinning to handle large motions, and learn a non-rigid deformation field for pose-dependent effects. - Specifically, given a point \(\mathbf{p}'\) in posed space and the pose \(\theta\), they first find the corresponding canonical point via inverse LBS (they do this by predicting skinning weights \(w\) for \(\mathbf{p}'\) and then applying \(W^{-1}(\mathbf{p}',w,\theta)\)). This gives an initial \(\mathbf{p}\) in canonical space. - Then they evaluate a displacement field \(\delta \mathbf{p} = \delta(\mathbf{p},\theta)\) which tells how \(\mathbf{p}\) should move in canonical space due to the pose. - They add this, obtaining a deformed canonical point \(\mathbf{p}^* = \mathbf{p} + \delta(\mathbf{p},\theta)\). - Finally, they evaluate the signed distance field \(f(\mathbf{p}^*)\). If \(f(\mathbf{p}^*) \le 0\) (inside or on surface), then the original point \(\mathbf{p}'\) is inside/on the posed surface.
The networks involved: - One network predicts skinning weights \(w(\mathbf{p}',\theta)\) for the query (this is effectively a backward skinning field, pose-conditioned). - Another network represents the displacement field \(\delta(\mathbf{p},\theta)\). - And another represents the canonical SDF.
They likely train them jointly. The loss comes from known ground truth occupancies of posed scans: they sample points and know whether they are inside or outside the clothed surface in posed state (from scan data).
Neural-GIF results show realistic pose-dependent wrinkles and dynamics for clothing. One benefit of their approach is not needing a pre-registered template; they learn from raw scans (like SCANimate).
In essence, Neural-GIF formalizes the canonical shape + learned pose deformation concept in a neural implicit way. It validates that even without forward skinning, if you have enough data of one person, a backward method can learn pose generalization by focusing on the deltas (since the main change between poses is captured by \(\delta(\mathbf{p},\theta)\), which can be learned as a smooth function).
SNARF: Skinned Neural Articulated Implicit Shapes (ICCV 2021)
We have already covered SNARF’s methodology in depth. To recap salient points:
SNARF proposes a pose-independent canonical skinning field and finds correspondences via root finding. It demonstrated that such forward-skinning implicit models can outperform backward ones in pose generalization.
SNARF’s training was done on synthetic data: they used a dataset of clothed humans (likely generated via CLOTH3D or by dressing SMPL with different outfits and posing them). They had meshes for each training sample (with consistent vertex ordering per sequence since they come from a simulation, presumably). Using these, they could sample points and know occupancy.
They do not require knowing which canonical points correspond – SNARF finds them during training by solving \(W_\theta(\mathbf{p})=\mathbf{p}'\) as described. The analytical gradients mean they can backpropagate without having to differentiate through the iterative steps explicitly (they derive a formula via implicit differentiation).
Results: SNARF presented qualitative results of clothed humans (loose shirts, skirts, etc.) doing various poses. It maintained coherent geometry even in challenging poses like one arm touching the opposite side, where backward methods might create broken surfaces.
Limitations: SNARF in its original form still assumes one model per subject (like training on one person’s scans or one outfit’s variations). It doesn’t natively handle multiple garment types in one network (though one could extend it by adding a latent code for clothing type as in SMPLicit). Also, SNARF does not model texture – it’s geometry only.
Follow-up - FastSNARF: This improved algorithmic efficiency by formulating the root finding differently, perhaps using better initialization or analytical solutions in linearized cases. It made it feasible to train SNARF-like models much faster.
POP: The Power of Points (ICCV 2021) – Point-Based Modeling
Shifting from pure implicit, POP by Ma et al. (ICCV 2021) demonstrated a powerful alternative for animatable avatars.
POP represents a clothed human as an articulated point cloud with local features, rather than a mesh or continuous field. This point cloud is dense (they use ~10k points) and is associated with a particular subject+outfit. However, unlike a raw scan’s point cloud, POP’s points have additional learned parameters that enable generalization:
Each point has a learned feature vector encoding local geometry details. This feature could be thought of analogous to a latent code per point for the shape of the nearby surface patch.
Each point also has a fixed position in a canonical pose and is attached to the body (they likely use the nearest SMPL vertex or bone weighting for each point to move it with the skeleton rigidly).
A neural network (likely an MLP) takes as input the pose (and possibly global shape info) and outputs a deformation for each point or an update to its feature. The paper mentions a “novel local clothing geometric feature” used to represent shape, and that the network learns to model pose-dependent clothing deformations.
The training of POP is done on many subjects and outfits in many poses. They want a single model that works for multiple outfits. During training, they have numerous 3D scans of people in different clothing and poses. They must somehow align these to a common representation:
They likely perform a nonrigid registration of a point cloud template to each scan. Or since they have SMPL for each scan (the data includes SMPL fits under clothing from CAPE dataset), they might seed points on SMPL surface and offset them to scan surface to initialize correspondences.
They learn a shared point set that can fit all outfits via different point features. Possibly they anchor points to SMPL body coordinates (like UV map positions on SMPL surface) so that each point corresponds to a particular body location. The variation in clothing is then handled by points moving off the body and feature differences.
At inference, POP can fit a new scan (unseen outfit) by optimizing the point features to reconstruct that scan. They call this optimizing the “geometry feature”. This is analogous to finding the latent code in SMPLicit that fits a new scan. After fitting, the new outfit is represented by a point cloud with learned features, which POP’s network can then animate to novel poses.
Compatibility and speed: POP results show it outperforms state-of-art in multi-outfit modeling and unseen outfit animation. It also naturally produces surfaces (the points can be rendered as surfels, or triangulated using e.g. proximity). Since the representation is explicit points, it’s easier to integrate with graphics engines (as noted: implicit methods lack compatibility with standard tools, whereas point clouds can be handled by particle systems or meshing algorithms).
One can think of POP as doing for points what SMPL did for vertices: SMPL had a fixed mesh and learned deformations (blendshapes) for new shapes/poses. POP has a fixed set of points and learns how their positions shift for new outfits and poses.
Comparison to implicit: POP’s advantages include no need for heavy root-finding or neural query per-frame – animating is just linear blend plus a small MLP per point. This makes it fast. It’s also easier to enforce no penetrations (they could, for example, ensure points representing cloth stay outside the body by construction).
A disadvantage is that resolution is limited by number of points; extremely fine details or topology changes beyond what points can sample might be missed (though 10k points is quite dense). Also, if clothing has very thin structures (like belts, straps), implicit fields might represent those better as a continuous surface, whereas a sparse point cloud might need many points to capture a thin strip.
Other Noteworthy Methods
To be comprehensive, let’s briefly mention a few other relevant methods and how they fit:
PIFu / PIFuHD (CVPR 2019 & 2020): These are image-to-implicit models that infer a person’s shape from a single image. They use pixel-aligned features to query an occupancy function. While not directly an avatar model (they’re more for 3D reconstruction), PIFu’s occupancy network can be considered a static implicit model for clothed bodies (with no articulation component).
ARCH, ARCH++ (CVPR 2020, 2021): These models from Chen et al. learned an implicit occupancy with explicit correspondences to a template. They predict an implicit surface and a UV mapping to a template, thereby combining a parametric model with implicit details.
SCARF (SIGAsia 2022): Already discussed, it’s a hybrid radiance field model where the body uses SMPL (explicit) and clothing is an implicit volumetric representation. It is optimized from video, showcasing application of NeRF for clothed avatars with layered representation (body vs clothes separated).
UNIF (ECCV 2022): This improves NASA’s part-based approach by removing the need for part labels, learning to partition the space based on motion consistency. UNIF still defines one implicit SDF per bone, but it learns to assign points to bones via a bone-centered initialization + losses rather than ground-truth segmentation.
Blueprint Algorithms for Key Methods
To further clarify how these methods work in practice, we provide high-level pseudocode for the training or usage of three representative methods: SNARF, POP, and SMPLicit. These are not exact code, but outline the main steps and network operations in each case.
SNARF (Training Procedure): Differentiable Forward Skinning for Implicit Surfaces
SNARF learns a canonical implicit shape \(f(\mathbf{p})\) and a skinning weight field \(w_b(\mathbf{p})\) for \(B\) bones. Training data are posed meshes (or occupancy data) with known skeleton poses.
1# Initialize neural networks:
2F_theta(p) # implicit occupancy/SDF in canonical space (MLP)
3W_phi(p) # skinning weight predictor (MLP that outputs w_1...w_B for point p)
4
5for each training iteration:
6 # Sample a training mesh with pose θ (bone transforms G_b for b=1..B)
7 # Sample N points {x_i'} in the posed space of that mesh (some near surface, some inside/outside)
8 For each point x_i':
9 # Find canonical correspondences via forward skinning
10 Compute skin weights via network: w = W_phi(x_i')
11 Define function g_i(p) = \sum_b w_b * (G_b(θ) * [p;1])_{1:3} - x_i'
12 # This gives difference between forward-skinned canonical point p and target posed point x_i'
13 Use root-finding (e.g. iterative nonlinear solve) to find all solutions p_i such that g_i(p_i) ≈ 0
14 For each solution p_i (there can be multiple or none):
15 Evaluate occupancy = F_theta(p_i) # implicit surface in canonical
16 Collect occupancy probabilities o_i (taking max if multiple p_i solutions)
17 Determine ground-truth occupancy label y_i for x_i' (e.g. 1=inside if x_i' is inside training mesh, else 0)
18 Compute loss = \frac{1}{N}\sum_i BCE(o_i, y_i) # binary cross-entropy for occupancy
19 Backpropagate gradients:
20 # Gradients flow through F_theta and W_phi, and through the root-finding step via implicit differentiation
21 Update networks F_theta and W_phi (e.g. Adam optimizer)
After training, to use SNARF for animation: one would take the learned \(F_\theta\) and \(W_\phi\). Given a new pose \(\theta\) and canonical surface \(F_\theta(p)=0\), one can forward-skin each canonical point \(p\) via \(W_\phi\) weights and \(G_b(\theta)\) to get the deformed surface. For query-based rendering, for any posed-space point \(x'\), find \(p = W_\phi(x')^{-1}\) via root-finding and evaluate \(F_\theta(p)\). This implicit surface can be rendered or meshed.
POP (Training & Fitting Pipeline): Point-Based Model for Pose-Dependent Clothing
POP learns a single model that can represent many outfits. It has a fixed set of \(N\) points (with positions attached to a reference body) and learns two networks: one to predict how points move with pose, and one to encode outfit geometry into point features.
1# Initialize:
2{P_i (i=1..N)} # template point positions on a canonical body (e.g., sampled on SMPL surface)
3{v_i} # feature vectors for each point (learned, initially random)
4M_psi(v_i, θ) # Network that predicts an offset ΔP_i for point i given its feature v_i and pose θ
5(Optional second network E_phi) # encodes geometry of a new scan into point features
6
7# Training:
8for each iteration:
9 # Sample a training subject j with outfit j, and a pose θ_k from that subject's dataset
10 # Retrieve ground-truth surface points X (or mesh) for subject j in pose θ_k
11 # (Assume we have correspondence of X to our template points P via nearest neighbor or known mapping)
12 For each point i:
13 Compute deformed position: P_i' = W_SMPL(θ_k, P_i) # apply skeleton rigid transform to base point
14 Compute neural offset: ΔP_i = M_psi(v_i_j, θ_k)
15 Predicted position = \hat{X}_i = P_i' + ΔP_i
16 Compute loss = ∑_i || \hat{X}_i - X_i||^2 # point-to-point distance between predicted and actual surface
17 Backpropagate to update M_psi and point features v_i_j for that outfit
18
19# (Optional) Fitting new outfit:
20# Given new scan (subject m, outfit m), with multiple poses scans:
21Initialize point features {v_i_m} (e.g. from body shape or average)
22For each scan of subject m:
23 optimize {v_i_m} to minimize ∑_i ||P_i' + M_psi(v_i_m, θ_scan) - X_i^{scan}||^2
24# (This fits the point cloud model to all poses of new outfit)
After training, POP’s network \(M_\psi\) is fixed. To animate a new outfit, one obtains its point features (via the fitting process or an encoder). Then for any new pose \(\theta\), each point’s motion is given by \(P_i' = W_{\text{SMPL}}(\theta, P_i)\) plus the learned corrective \(\Delta P_i = M_\psi(v_i, \theta)\). The result is a new point cloud for the outfit in that pose, which can be rendered (e.g., splatting or surfels) or triangulated for a mesh.
SMPLicit (Inference/Fitting Workflow): Generative Implicit Garment Model conditioned on SMPL
SMPLicit is trained with an image encoder and auto-decoder for latent, and supervising UDF values. Here we outline how one would fit SMPLicit to a new observation (e.g., a depth scan or image of a person in clothing) and then generate/animate the output:
1# Inputs: SMPL body shape β and pose θ (if known or estimable), and target observation (e.g., 3D point cloud of clothed person)
2Initialize latent code z (garment style latent) - either from encoder (if using image) or zero
3
4for iter in 1..T:
5 for each query point q in space (sampled around body or on depth):
6 Compute body-relative coords: (e.g., find nearest body surface point and encode q as (u,v,dist) on SMPL)
7 Predict unsigned distance d = C_Theta(q, β, θ, z) # Network inference
8 Compute loss from target:
9 If q is on observed surface, enforce d ≈ 0
10 If q is outside (in free space), enforce d > 0 (e.g. ≥ margin)
11 If q is inside observed volume, enforce d ≈ 0 as well (for inside points, since unsigned)
12 Backpropagate loss; update latent code z (and body β if optimizing shape)
13Optimize until convergence
14
15# Output:
16# Latent code z* that best explains the observation, and possibly refined body shape β*
17# Generate dense sampling: evaluate C_Theta on a 3D grid around body with z*, get unsigned distance field
18# Run Marching Cubes on the field d(x)=0 to extract garment mesh
19# Attach garment mesh to SMPL: for each garment vertex, find nearest SMPL vertex and copy skin weights
Now the fitted SMPLicit model can be animated: apply a new pose \(\theta_{\text{new}}\) to SMPL body, then skin the garment mesh with the copied weights. Because SMPLicit’s garments are generated in canonical pose and then just follow the body, extreme poses might cause visual artifacts (since the model itself didn’t encode pose-dependent cloth deformation). But as shown by Corona et al., it was sufficient for many scenarios.
This pipeline leverages the differentiability of the implicit model to do inverse fitting. Additionally, one could sample the latent space \(z\) to generate new random outfits on the same body (for example, varying tightness or length as the latent has semantic directions).
Historical Perspective and Future Outlook
In little over five years, the field has progressed from parametric mesh models of minimally-clothed bodies to highly expressive neural models of clothed humans. Early works like SMPL (2015) and its extensions (SMPL+DH for clothing, etc.) set the stage by establishing a common body parameterization. Around 2018-2019, deep learning on 3D surfaces took off: Occupancy Networks and DeepSDF showed that implicit fields can represent shapes with unprecedented detail and continuity.
The human modeling community quickly adopted these for clothed people: PIFu (2019) applied occupancy nets to single-view reconstruction of clothed humans; NASA (2019) introduced part-wise implicit models for articulated objects; CAPE (2020) demonstrated learning clothing deformations as offsets on SMPL with a large dataset.
By 2020-2021, we see a proliferation of neural implicit avatar models: - SCANimate (CVPR 2021) and Neural-GIF (ICCV 2021) showed how to learn from raw scans, bringing these methods closer to real data application. - SNARF (ICCV 2021) solved a key technical hurdle for inverse skinning and set a new standard for pose generalization. - SMPLicit (CVPR 2021) bridged parametric and implicit models, enabling generative topology-varying clothes. - POP (ICCV 2021) re-imagined explicit point representations and achieved similar goals with arguably simpler inference.
In 2022, research built on these foundations: imGHUM (2021/22) provided a full-body implicit model (adopted in some image-fitting pipelines like ICON), UNIF (ECCV 2022) improved part-based approaches, SCARF (2022) and related NeRF-human works integrated photorealistic rendering with geometric models (implicitly starting to unify shape and appearance).
We also see the rise of diffusion models and latent generative models for humans (e.g., DreamAvatar 2023 uses imGHUM as a differentiable prior in a diffusion pipeline to generate avatars). These indicate a future where one can generate a realistic clothed avatar from high-level descriptors and animate it – essentially achieving the decades-old vision of fully data-driven virtual human creation.
Challenges and Future Directions
Despite progress, challenges remain:
Data Efficiency: Capturing real clothed humans with sufficient quality for training is hard; methods are moving toward using commodity data (monocular videos, multi-view but sparse rigs) thanks to advances in differentiable rendering and neural fields.
Dynamic Clothing: Secondary motion like cloth simulation is not fully solved by these quasi-static models; combining physical simulation or learned physics priors with neural avatars is an open area.
Interactivity and Editing: Models like SMPLicit allow latent editing of garments, but a user-friendly and interpretable control (e.g., “shorten the sleeves”) is still being developed, possibly via semantic latent spaces or cross-modal (text2avatar) models.
Performance: Real-time applications would benefit from lighter representations – techniques like POP’s point-based or optimized octree implicit models are promising to deploy avatars in real-time VR/AR.
Another trend is standardizing evaluation – as many methods are quite complex, the community is starting to establish benchmarks for clothed avatar reconstruction and animation (datasets like CAPE, Cloth3D, RenderPeople, etc., are used widely, and metrics like IoU, chamfer distance for surfaces, are used to compare methods). We anticipate a convergence where the best ideas from each line – e.g., the topological flexibility of implicit surfaces, the speed of point-based methods, the leveraging of body priors from parametric models, and the realism of radiance fields – will combine. Hybrid models (like SCARF) already hint at this convergence.
Conclusion
Neural implicit and point-based representations have revolutionized clothing and human body modeling. They overcome many limitations of meshes, enabling high-fidelity, differentiable avatars that can be learned from data. The comparative analysis suggests no one representation is strictly superior – each has trade-offs.
Mesh-based models are still useful for certain applications (especially where integration with existing pipelines and real-time are needed), but implicit models unlock unlimited detail and learning-friendly frameworks, while point-based models offer a compelling middle ground. Together, these developments bring us closer to lifelike virtual humans that can be automatically created and animated – a technology with wide-ranging applications in graphics, vision, VR/AR, and the metaverse.