Lecture 06.2 - Learning-Based Fitting of the SMPL Model to Images

Lecture Slides: Learning-Based Fitting of the SMPL Model to Images

Introduction

In the previous lecture, we explored optimization-based methods for fitting the SMPL body model to images (e.g., the SMPLify algorithm). While these approaches are effective without requiring large training datasets, they come with certain limitations: they can be slow (taking several seconds per image), are sensitive to initialization, and may get stuck in local minima.

This lecture shifts our focus to learning-based methods for SMPL model fitting. These approaches leverage deep neural networks to regress SMPL parameters directly from images, often achieving:

Real-time performance (milliseconds vs. seconds per image)
Greater robustness to initialization and image variations
Improved ability to handle ambiguous or partial observations

We will explore pure regression approaches, hybrid methods that integrate optimization in-the-loop, and advanced techniques for video sequences. We’ll derive the mathematical foundations and highlight key architectures, from simple encoders to adversarial and temporal models.

Foundations of Learning-Based SMPL Estimation

Integrating SMPL into Neural Networks

A core idea in learning-based methods is to make the parametric SMPL model part of a neural network’s forward pass. Recall that SMPL is a differentiable function \(M(\boldsymbol{\theta}, \boldsymbol{\beta})\) that outputs a posed mesh (vertices \(V \in \mathbb{R}^{3\times N}\)) given pose parameters \(\boldsymbol{\theta}\) (joint rotations) and shape parameters \(\boldsymbol{\beta}\) (body shape PCA coefficients):

\[V = M(\boldsymbol{\theta}, \boldsymbol{\beta})\]

where \(\boldsymbol{\beta} \in \mathbb{R}^{10}\) and \(\boldsymbol{\theta} \in \mathbb{R}^{3K}\) (typically with \(K=23\) body joints, parameterized in axis-angle or rotation matrices).

The model’s differentiability means we can backpropagate through it: small changes in \(\boldsymbol{\theta}, \boldsymbol{\beta}\) yield smooth changes in vertices and hence in any differentiable error measured on the output. Learning-based approaches implement SMPL layers in frameworks like TensorFlow or PyTorch so that predicted pose/shape parameters can be refined via gradient descent on task losses.

Projection Functions

To train on 2D images, we need to compare the 3D model output with 2D annotations (e.g., keypoints, silhouettes). This requires a projection function \(\Pi\) to map 3D points to the image plane. Two projection models are commonly used:

Full Perspective Projection

Given camera intrinsics (focal length \(f\), principal point \((c_x, c_y)\)) and extrinsic rotation \(R \in SO(3)\), translation \(t = (t_x, t_y, t_z)\), a 3D point \(X = (X_X, X_Y, X_Z)\) in the world projects to pixel coordinates \(x = (u, v)\) via:

\[\begin{split}X' &= R X + t \\ u &= f \frac{X'_x}{X'_z} + c_x \\ v &= f \frac{X'_y}{X'_z} + c_y\end{split}\]

where \((X'_x, X'_y, X'_z)\) are the coordinates of \(X\) in the camera frame. This perspective model accounts for foreshortening (apparent size changes with depth).

Weak-Perspective (Scaled Orthographic) Projection

Many methods assume the person occupies a small field of view, allowing a simplified model:

\[x = s \cdot (R X)_{xy} + t_{xy}\]

Here \(s\) is an overall scale (related to focal length and average depth), \((R X)_{xy}\) are the first two components of \(R X\), and \(t_{xy} = (t_x, t_y)\) is an in-plane translation. Effectively, all points are scaled by the same factor \(s\) and translated, ignoring \(z\)-depth variation.

For instance, in HMR (Human Mesh Recovery), the projection of the 3D joint positions \(X_i(\boldsymbol{\theta}, \boldsymbol{\beta})\) is modeled as:

\[\hat{x}^i = s \Pi_{ortho}(R X_i) + t\]

where \(\Pi_{\text{ortho}}(X) = (X_x, X_y)\) simply drops \(z\). Weak perspective is convenient since \(s\) and \(t\) can be learned as part of the regression. However, it introduces ambiguity between \(s\) and \(t_z\) (depth), which some methods later address by incorporating full-frame information.

Loss Functions

2D Reprojection Loss

During training, 2D reprojection loss is typically used: if \(x_i\) is the ground-truth 2D position (e.g., from keypoint annotators) of joint \(i\) and \(\hat{x}_i = \Pi(X_i(\boldsymbol{\theta}, \boldsymbol{\beta}))\) is the projected model joint, one minimizes:

\[L_{2D} = \sum_i w_i \|\hat{x}_i - x_i\|^2\]

summing over visible joints with weights \(w_i\) (to handle confidence or importance). This encourages the network to output pose parameters that explain the observed 2D pose.

3D Losses

Some works also include 3D losses when ground-truth 3D joint positions \(\tilde{X}_i\) or SMPL parameters are available for training data:

\[L_{3D} = \sum_i \|\hat{X}_i - \tilde{X}_i\|^2\]

or a direct parameter regression loss:

\[L_{param} = \|\hat{\boldsymbol{\theta}} - \tilde{\boldsymbol{\theta}}\|_1 + \|\hat{\boldsymbol{\beta}} - \tilde{\boldsymbol{\beta}}\|_1\]

Many modern approaches train in a mixed supervised manner: some images have 3D labels (allowing \(L_{3D}\)) and a vast number have only 2D labels (using \(L_{2D}\)). The total objective is a weighted sum, with losses zeroed out if not applicable to a given sample.

Silhouette and Segmentation Losses

Beyond keypoints, another 2D cue is the person’s silhouette or part segmentation. If one renders the SMPL mesh (using the predicted parameters and some camera guess) into a binary mask or a part index image, it can be compared against a ground-truth silhouette/segmentation.

A common formulation is a silhouette overlap (IoU) loss or binary cross-entropy per pixel. These losses enforce that the projected 3D mesh aligns with the person’s outline in the image. For instance, Pavlakos et al. (2018) incorporated a silhouette consistency term in addition to keypoints.

Silhouette losses are less commonly used than keypoints (since obtaining accurate silhouettes for training is harder), but they provide information about body shape and can correct pose ambiguities not evident from sparse joints.

Statistical Priors and Adversarial Losses

A recurring challenge is that predicting 3D pose from 2D is an ill-posed problem – many 3D configurations project to the same 2D points. Without regularization, a network trained purely on \(L_{2D}\) may output implausible bodies that nevertheless yield low reprojection error (e.g., bent limbs folded behind the camera, or unnatural poses).

To address this, learning-based methods incorporate priors on pose and shape:

Explicit Priors

One approach is to use explicit priors: for example, penalize improbable shapes via a Gaussian prior on \(\boldsymbol{\beta}\) (e.g., \(L_{shape} = \|\boldsymbol{\beta}\|^2\) if the shape PCA is zero-mean, ensuring shapes stay near average) or on joint angles (as in SMPLify’s pose prior).

Adversarial Priors

Another very successful approach is to learn a latent pose prior via adversarial training. Adversarial approaches introduce a discriminator \(D\) that tries to distinguish “real” human model parameters from “fake” ones predicted by the network.

Kanazawa et al. (2018) pioneered this in HMR, training \(D(\boldsymbol{\theta}, \boldsymbol{\beta})\) to output 1 for samples coming from a large motion capture dataset and 0 for the network’s outputs. This discriminator acts as a data-driven prior: the regression network (the generator) gets an additional loss \(L_{adv}\) if its predicted pose parameters lie outside the distribution of plausible humans.

The adversarial loss can be formulated (using least-squares GAN formulation) as:

\[L_{adv}(E) = \sum_i \mathbb{E}_{\Theta \sim E(I)}[(D_i(\Theta) - 1)^2]\]

where \(E(I)\) are the predicted parameters from image \(I\), and \(D_i\) ranges over multiple discriminators if the adversary is factorized. The discriminator(s) are simultaneously trained to minimize:

\[\sum_i \mathbb{E}_{\Theta \sim p_{real}}[(D_i(\Theta) - 1)^2] + \mathbb{E}_{\Theta \sim E(I)}[(D_i(\Theta))^2]\]

Intuitively, the regressor is pushed to generate outputs that fool the discriminator, i.e., that look like real human poses in the SMPL parameter space. This strategy implicitly learns joint-angle limits, anthropometric feasibility (limb lengths, etc.), and typical pose combinations from data.

Kanazawa et al. further factorized the adversarial prior by breaking the SMPL parameter vector into parts: one discriminator for shape (\(\boldsymbol{\beta}\)), one for each joint’s rotation, and one for the whole pose together. Because each input to \(D\) is low-dimensional (e.g., a 9D rotation representation per joint), these discriminators are small and easier to train than one large discriminator on the full pose vector.

The joint-wise \(D\) effectively learns the acceptable range for each joint (learning to detect out-of-range angles), while the full-pose \(D\) captures inter-joint correlations (e.g., you can’t raise your left foot without bending your left knee). This factorized adversarial loss proved crucial to regularize the network when no ground-truth 3D labels are available.

In summary, learning-based SMPL fitting frameworks build a differentiable pipeline: Image \(\to\) CNN features \(\to\) SMPL parameters \(\to\) Projected 2D outputs, trained end-to-end. They combine reprojection losses for alignment, 3D losses when possible for accuracy, and priors (explicit or learned adversarial) to constrain the solution space.

Early Regression Approaches: HMR and NBF

Human Mesh Recovery (HMR)

One of the first end-to-end learning methods for human mesh recovery was HMR (Human Mesh Recovery) by Kanazawa et al. (CVPR 2018). HMR demonstrated that a deep network could directly regress the 85-dimensional SMPL parameter vector (\(\boldsymbol{\theta} \in \mathbb{R}^{69}\) for pose, \(\boldsymbol{\beta} \in \mathbb{R}^{10}\) for shape, and optionally global orientation + camera \((R, s, t)\)) from a single RGB image.

This represented a paradigm shift: instead of per-image optimization at inference time, a neural network “learns” to do the fitting.

Architecture

At its core, HMR uses a convolutional encoder (based on ResNet) feeding into a pose parameter regressor that is implemented as an Iterative Error Feedback (IEF) loop. IEF is a technique where an initial guess of parameters is progressively refined by the network in a fixed number of iterations (implemented by a recurrent or unrolled loop).

The network predicts a series of parameter updates \(\Delta \Theta\) to add to the current estimate, gradually improving the fit. This helps the network handle large output dimensions by breaking down the regression task into smaller corrective steps.

In practice, HMR’s regressor outputs \(\Theta = \{\boldsymbol{\theta}, \boldsymbol{\beta}, \mathbf{c}\}\) (where \(\mathbf{c}\) are weak-perspective camera params \((s, t_x, t_y)\)) and does so in 3 iterative refinements.

The final output’s 3D joints \(\hat{X}(\Theta)\) are projected with the weak-perspective model and compared to ground-truth 2D keypoints for a reprojection loss. If any 3D pose labels are available (e.g., from motion capture datasets), an additional 3D loss on joint positions or parameters can be applied.

Adversarial Pose Prior

The key innovation of HMR was introducing an adversarial loss to cope with the lack of direct 3D supervision on in-the-wild images. A discriminator \(D\) was trained on a large corpus of SMPL parameters obtained from mocap (the CMU MoCap dataset processed via MoSh, yielding a distribution of plausible human poses and shapes).

The discriminator learns to output “real” for any parameter vector coming from this dataset and “fake” for the network’s output. The HMR regressor then receives an adversarial loss term that pushes its outputs to be indistinguishable from real poses.

This adversarial prior ensures that even if an image’s 2D keypoints are sparse or ambiguous, the estimated 3D pose will lie on the manifold of realistic human poses. Notably, HMR’s adversarial prior captured a variety of constraints implicitly – from joint angle limits to natural pose transitions – outperforming simpler priors like SMPLify’s Gaussian Mixture Model on joint angles.

The overall HMR loss is:

\[L_{HMR} = L_{2D\_joints} + \lambda_{3D} L_{3D} + \lambda_{adv} L_{adv}\]

with \(\lambda\) weights and \(L_{3D}\) used only for those images (e.g., from Human3.6M) where 3D ground truth is available.

HMR was trained on a mix of datasets (several 2D pose datasets for \(L_{2D}\) and some 3D mocap data for \(L_{adv}\) and optional \(L_{3D}\)) and was the first to demonstrate end-to-end learning of pose and shape with an inference speed of ~30 FPS – a huge advantage over multi-second optimization. Its accuracy on 3D pose benchmarks was on par with or better than optimization methods of the time, and qualitatively it produced reasonable body shapes.

Neural Body Fitting (NBF)

In parallel to HMR, Omran et al. (3DV 2018) proposed Neural Body Fitting (NBF), which took a slightly different approach. NBF is a hybrid CNN + model-based method that explicitly incorporates body part segmentation as an intermediate representation.

The pipeline consists of two stages: 1. A 2D part segmentation network processes the input image into a segmentation map with labels for torso, limbs, head, etc. 2. An encoder network takes this color-coded part segmentation and regresses the SMPL parameters \(\boldsymbol{\theta}, \boldsymbol{\beta}\).

The motivation is that part segments provide a richer and less ambiguous input than raw pixels for inferring 3D pose – they roughly capture the 2D shape and orientation of limbs. By first solving an easier 2D vision problem (segmentation), the burden on the 3D regressor is reduced.

Mathematically, if \(S(I)\) is the segmentation probability map produced from image \(I\), NBF learns a function \(f: S(I) \mapsto (\hat{\boldsymbol{\theta}}, \hat{\boldsymbol{\beta}})\). The SMPL model is then applied to get a mesh \(M(\hat{\boldsymbol{\theta}}, \hat{\boldsymbol{\beta}})\), from which 3D joints \(\hat{X}\) are obtained (via a fixed regressor matrix \(J\) that maps vertices to joints).

These joints are projected to 2D with a known camera model (NBF assumed a given camera or estimated a simple orthographic projection). The entire chain \(I \to S(I) \to f(\cdot) \to \hat{X} \to \hat{x}\) is differentiable, so NBF can be trained end-to-end with a combination of 2D joint losses and 3D losses (when available).

One notable design choice: NBF applied the loss on 3D rotations in matrix form rather than on the raw axis-angle parameters. They found that regressing a rotation in axis-angle space and comparing to ground truth can be problematic (due to angle wrapping and the fact that a small Euclidean error in axis-angle might not correspond to a small rotation difference).

Instead, they convert predicted and GT rotations to rotation matrices and compute an \(L_1\) or \(L_2\) loss on the matrix entries (or equivalently on axis-angle after mapping to nearest rotation). This ensures the loss directly penalizes orientation discrepancy in a smooth way.

Comparison of HMR and NBF

Both are single-image, fully differentiable models that output SMPL parameters. The key differences are:

Intermediate representation: HMR is direct regression from raw pixels, which is elegant and requires minimal preprocessing but relies heavily on the network learning internal representations for body parts and geometry. NBF explicitly gives the network a structured input (the part segmentation), which can make learning easier and also provides a degree of interpretability.
Training paradigm: NBF did not use an adversarial prior; instead, it relied on some 3D labels (from a dataset called UP-3D which provides fitted SMPL parameters to real images) and implicit regularization from the segmentation input to keep outputs realistic.

In practice, NBF achieved competitive results to HMR on benchmarks, confirming that 2D proxy tasks (like segmentation or keypoint detection) can usefully guide 3D regression. Later methods would combine both strategies – using rich intermediate representations and still maintaining an end-to-end trainable system.

Evolving Architectures: Hybrid and Improved Regression Methods

After HMR and NBF, a flurry of works in 2019-2021 proposed improvements. Here we highlight major developments: integrating model-fitting into training (SPIN), refining regressors with feedback (PyMAF), leveraging full-image information (CLIFF), and extending to whole-body models (PIXIE).

SPIN: Optimization in the Training Loop

While purely regression-based methods are fast, they can suffer if the training data is limited or if the 2D-to-3D mapping is too ambiguous. On the other hand, classical optimization (like SMPLify) can often find a very precise fit for a given image by directly minimizing reprojection error – but it’s slow at test time and requires a good initialization.

SPIN (SMPLify-In-the-Loop) by Kolotouros et al. (ICCV 2019) attempts to get the best of both worlds. The idea is to use optimization during training to supervise the network, but at test time the network alone produces the result (no slow optimization needed).

Self-Improving Loop

In each training iteration, for a given image \(I\), the CNN first predicts an initial estimate of SMPL parameters \(\Theta_{reg} = (\boldsymbol{\theta}, \boldsymbol{\beta}, \mathbf{c})\).

Instead of immediately computing a loss on \(\Theta_{reg}\), SPIN uses this as a starting point for SMPLify (the optimization-based fitter) to refine the parameters to better match the 2D keypoints of that image. In other words, it runs a few iterations of analysis-by-synthesis model fitting, initialized from the network’s output.

This yields a new set \(\Theta_{opt}\) which (ideally) has lower reprojection error than \(\Theta_{reg}\). Now, \(\Theta_{opt}\) is treated as pseudo-ground-truth for the network, and a loss \(L = \|\Theta_{reg} - \Theta_{opt}\|^2\) is applied to train the CNN to predict closer to this optimally fitted solution.

In effect, the optimization module “corrects” the network on each training example, and the network is explicitly trained to mimic the optimizer’s result. This process is self-improving: - As the network gets better, its initial guesses \(\Theta_{reg}\) are closer to the optimum, making the optimizer’s job easier and more likely to succeed - Conversely, the optimizer provides increasingly accurate training targets which make the network even better

Crucially, this in-the-loop fitting provides a form of 3D supervision even for images that only have 2D annotations. Normally, without 3D ground truth, a purely regression method would have to rely on weak \(L_{2D}\) losses or an adversarial prior (as HMR did) to infer 3D structure.

SPIN bypasses this by using the optimizer to create a plausible 3D solution that explains the 2D points. That solution serves as a training signal. It was found that this “privileged” supervision (the optimizer has effectively done a mini model-fitting for that image) is more informative than a binary real/fake signal from a discriminator.

Indeed, Kolotouros et al. note that in a setting with only 2D keypoints, SPIN’s loop outperformed a HMR-style adversarial training.

Robust Training and Performance

Implementing SPIN requires care: sometimes SMPLify can fail or produce implausible fits (especially if the network’s initialization is poor in early training). The authors addressed this by:

Rejecting bad fits (if the reprojection error after fitting is above a threshold, they do not use \(\Theta_{opt}\), and instead fall back to a standard 2D loss for that image)
Clamping extreme shape values (avoiding out-of-bound \(\boldsymbol{\beta}\))
Keeping a dictionary of best fits per training image: if a later epoch’s SMPLify produces a worse result than a previous attempt, they retain the previous best, ensuring the supervision only gets better over time
Initializing this dictionary by running SMPLify offline on all training images (to give the network a reasonable starting target)

These tactics stabilized training.

SPIN was trained on a mix of datasets (similar to HMR: e.g., COCO and MPII for 2D, H3.6M and MPI-INF-3DHP for 3D, and additionally the 2D keypoints from LSP, etc.), and importantly no adversarial prior was used – the network leans on the optimized fits to stay realistic.

The results were impressive: SPIN significantly outperformed HMR and other contemporaries on benchmarks like 3DPW (a challenging dataset of outdoor videos). For instance, on 3DPW SPIN achieved a reconstruction error of ~59.2 mm, whereas HMR was around 81.3 mm. Even a baseline “SPIN without in-loop fitting” (training on static pseudo-labels) got ~66 mm, highlighting that the in-loop update gave a further boost.

Qualitatively, SPIN results were visually closer to the image evidence, since the network had essentially learned to perform a few iterations of keypoint alignment internally. SPIN’s approach underscored the value of combining optimization and learning: optimization can enhance training data, and learning makes optimization robust at runtime.

PyMAF: Pyramidal Mesh Alignment Feedback

Despite the advances of HMR, SPIN, and similar methods, a common issue remained: the precision of the final alignment. A network might predict pose parameters that give correct 2D joint projections, but the limbs or torso of the 3D mesh might still be slightly misaligned relative to the image (especially if evaluated by overlap or detailed correspondences).

Minor errors in joint angles compound along the kinematic chain, leading to visible misplacements of elbows or knees in the image frame. Additionally, regression networks often emphasize global pose (to get keypoints right) at the expense of fine local pose adjustments or shape adjustments that would improve pixel-level alignment.

PyMAF by Zhang et al. (ICCV 2021) addresses this by introducing an iterative refinement loop inside the network that explicitly checks the mesh-image alignment and corrects the parameters. “PyMAF” stands for Pyramidal Mesh Alignment Feedback. The pipeline is as follows:

The image is passed through a backbone to produce a pyramid of feature maps (high-level, low-resolution features and lower-level, higher-resolution features).
An initial prediction of SMPL parameters \((\boldsymbol{\theta}_0, \boldsymbol{\beta}_0)\) is made from the global feature (this is analogous to previous regressors).
The current predicted mesh (at iteration \(t\)) is then used to sample “mesh-aligned” evidence from the feature pyramid. Concretely, they project each vertex (or a selection of key vertices/landmarks) onto one of the higher-resolution feature maps and extract the feature values at those locations. These features tell the network how well the mesh aligns with image details. For example, if the hand vertices project onto a region of the image feature map that has background-like features instead of hand-like features, that indicates a misalignment.
These mesh-aligned features are fed into a refinement module (e.g., an MLP or small CNN) that predicts an update to the parameters: \((\Delta \boldsymbol{\theta}, \Delta \boldsymbol{\beta})\). The network then produces a corrected estimate \((\boldsymbol{\theta}_1, \boldsymbol{\beta}_1) = (\boldsymbol{\theta}_0 + \Delta \boldsymbol{\theta}, \boldsymbol{\beta}_0 + \Delta \boldsymbol{\beta})\).
This process can repeat for a couple of iterations (hence “feedback loop”), analogous to how IEF was used in HMR but here the feedback is guided by spatial feature information at projected mesh locations.

By leveraging higher-resolution feature maps in later refinement steps, PyMAF ensures that local image evidence (like the contour of an arm or the shape of a leg) can adjust the global prediction. This is important because a CNN’s deepest features (used for the initial prediction) are low-resolution and might not preserve precise spatial details.

The use of a feature pyramid (common in detection/segmentation networks) means even small misalignments can be detected at the appropriate scale. The outcome is a tighter mesh-image alignment, which the authors demonstrate qualitatively (the projected mesh outlines align better with the person’s silhouette).

Auxiliary Supervision

To train this system, PyMAF employs an auxiliary pixel-wise supervision signal. Essentially, they guide the feature extractor to be sensitive to mesh correspondence. While the details are technical, one way to implement this is to use a ground-truth correspondence map (if available, e.g., DensePose or segmentation) and ensure that the feature at a pixel encodes the identity of the body part or even the specific vertex it corresponds to.

This encourages the network that, when a predicted mesh is overlaid on the image, the features at mesh locations carry information about the true underlying body. Such dense supervision can be obtained from synthetic data or fitting-based annotations (the UP-3D dataset provided ground-truth part segments which could be used).

The authors note that this auxiliary loss makes the extracted “mesh-aligned evidence” more reliable, since the features are trained to represent body part information. In the absence of it, features might be distracted by textures or clothing, adding noise to the feedback loop.

PyMAF’s results showed improved per-pixel alignment and competitive pose accuracy. On 3DPW and Human3.6M, it improved error metrics slightly over previous SOTA, but more notably, it produced nicer visual alignment.

The approach is a bridge between pure regression and iterative fitting: it doesn’t run an external optimizer, but internally it iteratively samples the image at locations suggested by the current mesh and corrects the mesh – conceptually similar to how an analysis-by-synthesis optimizer would tweak parameters to better match image evidence, but here learned and done in a fixed small number of steps at inference.

CLIFF: Using Full-Frame Context for Camera Orientation

Most learning-based methods follow a top-down approach: they take a cropped image of a person (often cropped tightly around the person using a detector) and estimate the body relative to that crop. This has a limitation: once the image is cropped, the absolute position of the person in the original image and the true camera view are lost.

The network typically predicts the person’s pose in a crop-relative coordinate system, and uses a weak-perspective camera that is also relative to the crop. As a result, determining the global orientation of the person (e.g., facing north vs east in world coordinates) is tricky – rotating the camera or the person by 180° yields the same crop.

Methods like HMR or SPIN thus can only predict relative rotation (they often assume the person’s root joint rotation around vertical axis is zero in the crop, since any azimuth rotation is “absorbed” by the camera). In practice, this yields poor estimates of the global heading of the person and requires post-hoc adjustments if one needs the result in a global frame.

CLIFF (Carrying Location Information in Full Frames) by Li et al. (ECCV 2022) specifically tackles this issue. The key idea is to preserve the location information of the person in the full image both in the input and in the supervision. There are two main modifications in CLIFF:

Input Encoding of Location

Instead of only feeding the cropped RGB patch to the network, CLIFF also feeds in the person’s normalized location within the full image. In practice, one can concatenate extra channels or features that encode the bounding box position (e.g., the normalized center coordinates and scale of the crop relative to image size).

CLIFF obtains “holistic features” by combining the appearance features of the crop with this global location cue. By doing so, the network can learn correlations between a person’s position in the image and the likely perspective distortion or camera rotation.

For example, a person at the left edge of the image might be more likely turned sideways due to camera view, etc. This is akin to giving the network a sense of where the camera is relative to the person.

Full-Frame Reprojection Loss

Instead of computing the 2D keypoint loss in the crop coordinates, CLIFF computes it in the original image coordinates. They take the predicted 3D body (which now includes a global rotation parameter) and project it onto the full image, comparing to the ground-truth full-image keypoints.

Because the network knows it will be penalized for misaligning in the full frame, it must learn to predict the correct global rotation (otherwise, even if the pose looks right in the crop, the joints might fall at wrong full-image positions after undoing the crop transform).

In earlier methods, if a person is facing left vs right, the cropped image might look identical (just mirrored), and both would yield the same 2D loss; but in CLIFF, facing left vs right leads to different projections in the full frame (imagine the left-facing person’s joints are located more to the left in the full image vs the right-facing person’s joints). Thus the ambiguity is reduced.

These two changes allow CLIFF to directly predict the global orientation of the body (pelvis rotation in world coordinates) along with the pose and shape.

Training Data and Results

Training CLIFF required having some ground-truth global annotations. The authors leveraged the AGORA dataset (which provides ground-truth SMPL parameters in a global coordinate system) and some pseudo-labels.

They also built a pseudo-GT annotator based on CLIFF itself: after training an initial model, they used it with the full-frame loss to annotate in-the-wild images with global rotations, improving the training data.

In results, CLIFF significantly outperformed previous methods on metrics that depend on global orientation and position. For instance, on the 3DPW dataset CLIFF improved MPJPE by 5-6 mm over prior art and achieved state-of-the-art, and on the challenging AGORA dataset it ranked first on the public leaderboard at the time.

The benefit was most pronounced in global pose accuracy, while maintaining or improving local pose estimation.

In summary, CLIFF demonstrated the importance of using the full-frame context for 3D human pose: by “carrying location information” from the beginning, the network’s camera prediction no longer has to guess the depth or global rotation arbitrarily. This idea can be combined with any regression backbone (CLIFF’s architecture was built on a ResNet and an MLP head similar to SPIN’s) and is now a common consideration in extending single-person mesh recovery to multi-person scenes or images with camera movement.

PIXIE: Whole-Body Regression with Part Experts

All the methods discussed so far focus on the body pose and shape. However, the SMPL model has been extended to SMPL-X, which includes the face and hands (for a total of 104 shape parameters and 54 pose parameters including facial expression and finger joints).

Reconstructing a full human avatar from an image involves not only the body but also facial details and hand poses, which are challenging due to their fine-scale nature and often small pixel size in the image. Traditionally, computer vision has tackled body, face, and hands with separate specialized models: - Face shape from a headshot using 3D Morphable Models - Hand pose from a cropped image using a hand model - Body pose and shape from a full-body image

These specialized methods can capture details (like facial expressions or finger bending) better than a generic body model regressor. But they operate independently and might produce an inconsistent overall human (e.g., the face might not match the body shape, or the pose might be inconsistent at the wrist where body and hand meet).

PIXIE by Feng et al. (2021) is a collaborative regression method that combines experts for body, face, and hands to produce a single coherent SMPL-X fit. The name stands for “Pixel-Aligned Whole-body Human Pose and Shape Regression with Expression” (or “Parted X (SMPL-X) regression with Moderation” in the paper’s full title), reflecting its architecture with part experts and a moderator.

Architecture and Approach

Here’s how PIXIE works:

It has three expert regression networks: - One trained for whole-body (SMPL-X) pose and shape (much like an HMR or SPIN but for SMPL-X parameters) - One specialized for face (which predicts detailed facial shape and expression) - One for hands (predicting finger poses)

These experts are trained on their domain-specific data: e.g., the face expert on face datasets with 3D face scans or landmarks, the body expert on body pose datasets, etc.
All experts predict parameters that reside in SMPL-X’s shared space. Notably, SMPL-X uses a common shape vector for the body and face, meaning that if the face expert predicts a certain face shape, that should correspond to a certain body shape as well. PIXIE leverages this by ensuring the shape parameters are shared across the networks.
A moderator network takes the features or intermediate results of all experts and learns to weight and merge them into a final prediction. The moderator essentially decides how much to trust the face expert vs. the body expert for the head pose and shape, how much the hand expert vs. body expert for the wrist/hand, etc., on a per-instance basis.

For example, if the face is clearly visible (frontal, high-res), the face expert’s prediction of head shape should be given high weight; if the face is occluded or blurry, the body expert’s rough guess (which might be based on demographics or overall body shape) might be more reliable to avoid artifacts.
PIXIE also introduces a “gendered” shape loss. Human body shape is highly correlated with gender, and SMPL-X shape space implicitly represents gender (male vs. female body types).

They explicitly classify the subject’s gender from the image (or use annotation if available) and encourage the predicted shape \(\boldsymbol{\beta}\) to lie in the subspace for that gender. Concretely, they trained separate shape PCA for male and female from training data, and have PIXIE predict a gender label along with the shape; a loss then penalizes shape parameters that contradict the inferred gender.

This leads to more accurate body shapes, as prior methods that ignore gender might predict an unrealistic average shape when given ambiguous cues.

Training and Results

Training PIXIE required assembling a variety of datasets: 3D face scans for facial shape, body datasets for pose, hand datasets for finger pose, etc. During inference, PIXIE runs all experts and the moderator to output the final SMPL-X parameters (including face expression, jaw pose, hand pose, body pose, shape).

The result is an animatable full-body avatar with realistic face detail, something previous body-only methods could not provide. An illustration in the paper shows that a conventional SMPL-X regression (like ExPose or SPIN-X) yields a very generic face (since it’s not focusing on face shape), whereas PIXIE produces a face shape that matches the person (even capturing smile lines or jaw structure).

PIXIE achieved state-of-the-art accuracy on benchmarks for full-body capture, outperforming separate approaches that fit body and then refine face/hands independently. It also demonstrated the value of paying attention to demographic attributes (gender in this case) when estimating shape.

In practice, the authors released PIXIE’s code, enabling others to produce high-fidelity whole-body reconstructions from a single image. One limitation of PIXIE is that it still assumes the person is mostly visible (especially the face), and it doesn’t model clothing – it provides a nude body mesh with perhaps some offsets for facial detail.

Temporal Methods: From Single Images to Video Sequences

So far, we’ve considered single-image estimation. Now we’ll discuss extensions to video, where temporal consistency and motion realism become priorities. Estimating SMPL parameters for each frame independently often yields jittery results – the pose might flicker due to detector noise or minor network inconsistencies.

Moreover, without temporal context, an algorithm can’t enforce physically plausible transitions (e.g., limbs shouldn’t teleport from one position to another between adjacent frames). Several approaches have incorporated temporal models or priors to handle video input.

We’ll detail three notable methods: VIBE, TCMR, and MotionBERT, which represent the evolution from recurrent networks with adversarial motion priors to transformers with large-scale pretraining.

VIBE: Adversarial Motion Prior with GRUs

VIBE (Video Inference for Body Estimation) by Kocabas et al. (CVPR 2020) was one of the first frameworks to truly leverage video data to improve human mesh recovery. The authors observed that naive application of HMR on each frame yields shaky, unnatural motion, partly because the model has no knowledge of physics or typical motion patterns. To address this, VIBE introduced a temporal network and a motion discriminator.

Architecture

VIBE uses a two-stage approach:

First, each video frame is passed through a CNN (e.g., ResNet) to extract features. This is similar to HMR’s image encoder (and indeed they initialized it from SPIN’s pretrained model).
Then, these features are fed into a temporal encoder – specifically, a gated recurrent unit (GRU) – which processes the sequence of frame features and outputs SMPL parameters for each frame sequentially.

The GRU has an internal state that carries information from past frames, enabling it to produce smoother and context-aware predictions. Intuitively, if in frame \(t-1\) the person had arms raised and in frame \(t\) the CNN feature is ambiguous, the GRU might continue to predict arms raised because it “remembers” the previous state, rather than jittering to a random new pose.

Adversarial Motion Discriminator

Beyond just smoothing, VIBE wanted to ensure the sequence of poses is realistic as a whole. They leveraged the AMASS dataset – a large collection of motion capture sequences – to train a discriminator \(D_{motion}\) that looks at a window of predicted poses (e.g., 16 frames) and determines if this motion came from a real human or from the model.

This discriminator considers the temporal progression of joints, effectively learning what real motion dynamics look like (smooth acceleration, no jitter, plausible gait cycles, etc.). The regression network (the GRU + image encoder) is then adversarially trained to produce pose sequences that fool the motion discriminator.

In practice, the discriminator architecture in VIBE was also a GRU (or similar recurrent network) that outputs a probability of the sequence being real. The loss from this GAN-like setup complements the per-frame supervised losses: the network still gets 2D keypoint losses on each frame (to ensure accuracy per frame), and if available, some 3D supervision on some dataset, but additionally the adversarial loss pushes it to maintain realistic transitions.

By combining these, VIBE achieved both high per-frame accuracy and much improved temporal smoothness. In quantitative terms, on 3DPW (a dataset with ground-truth pose for videos), VIBE improved the PA-MPJPE and MPJPE by a noticeable margin over SPIN (which was state-of-the-art single-frame).

More importantly, it drastically reduced the acceleration error – a metric of jitter which measures frame-to-frame differences in pose accelerations. The outputs “looked” more like plausible motion; the paper’s Figure 1 illustrates how a previous method’s output had awkward inter-frame inconsistency while VIBE’s output was natural.

Training with Unpaired Data

One technical aspect: training VIBE required dealing with unpaired data – the mocap sequences in AMASS have no associated images (they are just 3D pose sequences). So the motion discriminator sees two kinds of input: 1. Poses produced by the regressor from video frames (these are “fake”) 2. Poses coming from AMASS (these are “real”)

The discriminator tries to distinguish them, while the regressor tries to fool it. This way, VIBE could train on real videos without 3D labels (for the adversarial part) and still benefit from the vast AMASS dataset (which provides examples of how limbs move, without needing to see the video).

This approach is analogous to HMR’s use of an adversarial prior for static poses, but extended to sequences as a temporal prior. As the authors note, this was critical because large-scale 3D motion annotations in the wild are scarce, but AMASS provides a rich repository of motions that can be used to regularize the network.

The final VIBE model was released with code and became a popular off-the-shelf solution for video pose estimation. Its limitations included occasional failure on very fast motions (where a GRU might lag) and the need for a pretrained keypoint detector (VIBE uses detected 2D keypoints during training as input as well, to better focus on the person). Nonetheless, VIBE set a new standard by demonstrating that temporal adversarial learning can significantly improve 3D human pose and shape estimation in videos.

TCMR: Temporally Consistent Mesh Recovery

While VIBE improved smoothness, it still operated mainly causally (forward in time with a GRU) and one could observe a trade-off: aggressive adversarial smoothing can sometimes lag the motion or oversmooth, sacrificing some accuracy for stability. In 2021, Choi et al. proposed TCMR (Temporally Consistent Mesh Recovery), which further addressed temporal consistency by architectural means rather than adversarial training.

Key Idea

TCMR introduces a two-stream temporal network: - One stream (let’s call it PoseForecast) focuses on predicting the current pose from past and future frames - The other (the main temporal encoder) processes the whole sequence including the current frame

By explicitly forecasting the current pose without using the current frame’s features, the network obtains a “second opinion” on what the pose should be, based purely on motion continuity. This forecast is then integrated with the actual current-frame-based prediction to yield the final output.

In doing so, they ensure that the current frame’s visual features (which might be noisy or temporarily confusing) do not solely dictate the estimation – the temporal prior can override or pull the solution toward a smooth trajectory.

Implementation

Concretely, TCMR uses a bidirectional recurrent network. One part of the network looks at a window of frames excluding the current one (or down-weights the current) to infer a residual pose update that would make the sequence smoother. This is added to or combined with the standard sequence encoding that does include the current frame.

At inference, they effectively remove the dependency on the current frame in one branch, so the final predictor is less dominated by instantaneous glitches. As the authors put it, they “remove the strong dependency on the current static feature” so the model “can focus on past and future frames without being dominated by the current frame”.

TCMR’s training still uses standard losses (2D keypoints, possibly 3D if available) but no adversarial component. By architecture design, it achieves smoothness.

The results showed that TCMR not only improved temporal consistency (as measured by acceleration error or similar metrics) by a large margin, but also slightly increased per-frame accuracy compared to VIBE.

Qualitatively, TCMR outputs are very stable – e.g., if a person’s arm is moving, the motion is fluid and without jitter, even better than VIBE which, under some circumstances, might wobble due to its GRU memory not perfectly capturing long-term context.

Because TCMR can leverage future frames (thanks to a bidirectional or non-causal design), it achieves a more stable pose at frame \(t\) knowing what happens at frame \(t+1, t+2, ...\) (of course, this means TCMR as described is an offline method, not suitable for real-time streaming use cases where future frames aren’t available; but one could run a delayed causal version in practice).

In summary, TCMR represents the refinement of temporal modeling: instead of adding another loss (like VIBE’s \(L_{adv}\)) to coerce the network into smooth behavior, TCMR builds the notion of forecasting and smoothing into the network’s structure. It validates that sequence modeling techniques (like bi-directional GRUs or sequence-to-sequence networks) can be very effective for human mesh recovery, yielding results that are both accurate and temporally coherent.

MotionBERT: Transformer-Based Motion Representations

The latest generation of video-based pose estimators employs transformers, which have shown great success in sequence modeling tasks across vision and NLP. MotionBERT by Zhu et al. (ICCV 2023) is a representative modern approach that offers a unified perspective on learning human motion representations.

MotionBERT is a bit different from previous methods in that it aims to pretrain a motion model that can be adapted to multiple tasks: 3D pose estimation from video, skeleton-based action recognition, etc. However, it also directly contributes to SMPL fitting by providing a powerful backbone for temporal pose estimation.

Architecture

MotionBERT uses a transformer-based encoder (specifically, a dual-stream transformer called DSTFormer in the paper) to process spatio-temporal pose data. It actually decouples the spatial and temporal attention: - One stream handles relationships between different joints in a single frame (the spatial configuration) - Another stream handles the temporal relationships of each joint over time

By alternating or combining these, the model captures complex motion patterns. This is in contrast to a GRU which might have trouble with long sequences or capturing global patterns (like periodic motion across many frames).

Pretraining on Heterogeneous Data

One of MotionBERT’s innovations is to leverage both motion capture data and large 2D keypoint datasets in a pretraining-finetuning paradigm.

In the pretraining stage, they generate sequences of 2D keypoints (some from real videos, some synthetically “masked” or noised) and train the transformer to lift them to 3D (essentially a 2D-to-3D pose reconstruction task). They corrupt the input in various ways (mask certain frames or joints) so that the model learns to fill in gaps, akin to BERT’s masking strategy but for motion.

This self-supervised or weakly-supervised training teaches the transformer a general notion of human motion continuity, even without explicit SMPL parameter output yet. Then, in fine-tuning, they specialize the model to specific tasks, one of which is 3D human mesh recovery from video: they attach a regression head that outputs SMPL pose and shape, and train on annotated datasets.

Because MotionBERT’s pretraining can use diverse data sources (2D pose sequences from YouTube, motion capture from AMASS, etc.), the resulting model has a strong prior for human motion. Fine-tuned on 3DPW or Human3.6M, it achieved state-of-the-art results. For instance, the paper reports a mean MPJPE of about 39.2 mm on Human3.6M, notably better than previous methods in the high-40s or 50s.

It also shows significant improvement in action recognition accuracy when using the learned representations, underlining the generality of the features. In the context of SMPL fitting, MotionBERT can be viewed as providing a very informed motion prior (even more implicitly than VIBE’s discriminator) – it’s “seen” a lot of motion during pretraining and thus will output pose sequences that both match the image evidence and make sense globally.

Practical Considerations

One difference in MotionBERT is that it often operates on 2D skeleton input rather than raw pixels (the version described in the paper lifts 2D keypoints to 3D). In a complete system, one could combine MotionBERT with an image-based pose detector: first extract 2D keypoints per frame, then use the transformer to produce the 3D pose and shape.

This decoupling can leverage the accuracy of 2D pose estimators (which are very mature) and the learned motion model for 3D, at the expense of not using the image fully (e.g., shape estimation might suffer if only skeletons are used).

However, nothing precludes using image features directly in a transformer for 3D human mesh – some recent works do exactly that, mixing CNN and transformer (a CNN extracts per-frame features, a transformer handles temporal aggregation, similar to how VIBE used a GRU). The trend is that transformers with long-range attention can better model motions that span longer times (like a full walking cycle or a dance move) than RNNs with limited memory.

In summary, MotionBERT pushes the frontier by applying modern sequence learning strategies and large-scale pretraining to human motion. It achieves a unified model that can be fine-tuned for 3D pose, shape, and even action understanding. For the task of human mesh recovery, it means we can get highly accurate and temporally stable predictions, leveraging both rich unlabeled video data and sophisticated architectures.

Comparison of Learning-Based Methods

Having surveyed major learning-based SMPL fitting techniques, we now compare them along key dimensions:

Supervision and Data

Early methods like HMR and NBF used primarily 2D annotated images and a bit of 3D mocap for adversarial training. They demonstrated that in-the-wild images with only 2D keypoints can be used to train a 3D regressor when combined with a strong prior (HMR’s discriminator or NBF’s part segmentation).

SPIN increased the use of mixed supervision: it exploited 2D keypoints from several datasets and 3D labels from controlled datasets, plus SMPLify fits as additional pseudo-labels. This resulted in a significant accuracy jump, showing the value of using all available data (the so-called “dataset fusion” training).

PyMAF and CLIFF continued this trend – for example, CLIFF used pseudo-3D annotations from its own annotator to augment training images without ground truth. Temporal methods similarly mix 2D keypoints from video (often obtained via an off-the-shelf 2D tracker) with datasets like 3DPW that have 3D sequences, and large mocap collections (for adversarial or pretraining).

In summary, modern approaches train on a melting pot of data: COCO, MPII, H36M, 3DHP, 3DPW, AMASS, etc., each providing different supervision signals. This has been crucial to reaching ~40-50 mm MPJPE levels.

Network Architecture

The progression has been from simple encoders (ResNet + MLP) to more complex ones with feedback or multi-stream designs:

HMR: Relatively simple (ResNet + IEF loop + discriminator)
NBF: Two-stage (segmentation + regression) modular design
SPIN: Simple architecture (ResNet + MLP head) but changed the training loop
PyMAF: Internal loop with a feature pyramid and attention to spatial detail
CLIFF: Extra inputs (full-frame info) but otherwise similar to HMR/SPIN
PIXIE: Multiple networks with a moderator
Video methods: Added RNNs (VIBE’s GRU, TCMR’s bi-GRU) or Transformers (MotionBERT)

Notably, as architectures evolved, the number of parameters increased modestly but not drastically – even MotionBERT remains in the few dozens of million parameters range, which is feasible. It is the training data and strategy that often make the bigger difference.

One can view many of these methods as different network topologies exploring the design space: e.g., iterative vs. direct regression, single vs. multi-head (NBF could be seen as a multi-head network: one head for segmentation, one for pose).

Objective Functions

All methods rely on reprojection loss on keypoints as the primary driver when ground truth 3D is scarce. But they differ in auxiliary losses:

HMR: Adversarial pose prior loss
NBF: Segmentation network has its cross-entropy loss; 3D parameter loss when possible
SPIN: Uses the SMPLify energy as being minimized to provide labels
PyMAF: Alignment loss (possibly silhouette or correspondence) to guide the pixel-aligned features
CLIFF: Same losses as SPIN (2D/3D keypoints), but computed in a different coordinate frame
PIXIE: Multi-part losses – face landmarks, body joints, hand joints, gender classification, and shape regularizer
VIBE: Per-frame keypoint losses + adversarial sequence loss
TCMR: Per-frame losses, but implicitly optimizes sequence smoothness through architecture
MotionBERT: Pretraining uses masked reconstruction loss on motion, fine-tuning uses joint losses

Pose and Shape Priors

Methods differ in how they ensure plausible outputs:

Optimization methods: Used explicit priors (like SMPLify’s pose prior GMM)
HMR: Replaced that with a learned adversary
NBF and others: Without adversaries sometimes include a mild \(\ell_2\) prior on shape (to keep \(\boldsymbol{\beta}\) near zero)
SPIN: Effectively used the prior baked into SMPLify’s results
PyMAF and CLIFF: Rely on training with SPIN/HMR outputs and datasets
PIXIE: Used gendered shape and the fact that face and body networks were trained on real data
VIBE: Motion prior is the adversarial discriminator on sequences
TCMR: Prior is its forecasting mechanism (trained on real sequences)
MotionBERT: Prior comes from pretraining on massive motion data and implicitly learning physics

Performance

It is difficult to cite exact numbers for all methods on all benchmarks (as they are often evaluated on different sets), but a general trend on the popular 3DPW dataset (which tests in-the-wild video, evaluated on MPJPE (mm) for pose accuracy) is as follows:

SMPLify (optimization): ~100+ mm
HMR (2018): ~81 mm
NBF (2018): Similar to HMR (not reported in the same way, but competitive)
SPIN (2019): ~59 mm
GraphCMR (2019): ~70 mm
PyMAF (2021): ~58 mm (slight improvement over SPIN)
CLIFF (2022): ~52 mm (significant jump by solving global orientation)
PIXIE (2021): On body metrics similar to PyMAF
VIBE (2020): ~51 mm on 3DPW
TCMR (2021): ~46-47 mm on 3DPW
MotionBERT (2023): ~45 mm or lower on 3DPW

These numbers are indicative (and often quoted after Procrustes alignment, etc., so one must be careful). The bottom line is that accuracy (in terms of joint error) has roughly halved from 2018 to 2023, thanks to learning-based approaches.

Qualitatively, the field has gone from wobbly, coarse reconstructions to quite realistic and stable reconstructions of human motion in everyday videos.

Runtime

Learning methods are generally fast. An HMR or SPIN model runs in ~30-60 FPS on a GPU (ResNet-50 based). VIBE, with a GRU over a 16-frame batch, also achieves near real-time performance (still on the order of 10-20 FPS).

TCMR and MotionBERT, if run non-causally on full sequences, are more offline and could be slower, but they can be chunked. In contrast, optimization like SMPLify took several seconds per image. Thus, learning-based methods enable applications like live motion capture from a webcam or processing large video datasets for analytics.

Strengths and Weaknesses

Regression methods (HMR, SPIN, etc.): - Strengths: Very fast, can leverage big data to avoid local minima, straightforward to integrate with modern frameworks - Weaknesses: May fail for poses unseen in training, can be thrown off by occlusions or unseen camera viewpoints, and without careful priors can output odd poses

Hybrid methods (SPIN, EFT): - Strengths: Get accuracy boost from optimization, can train with less 3D GT - Weaknesses: More complex training, still requires good 2D detections

Feedback methods (PyMAF): - Strengths: Better alignment, addresses some shortcomings of one-shot regression - Weaknesses: Slightly more computation, still ultimately limited by training data distribution

Full-frame methods (CLIFF): - Strengths: Resolves global orientation and position, needed for multi-person or camera-aware scenarios - Weaknesses: Requires known intrinsics or at least consistent bounding box references, and multi-person extension requires detecting and tracking multiple people

Whole-body (PIXIE): - Strengths: Combines best of specialized methods, achieves detailed face/hand reconstruction - Weaknesses: More network components, needs multi-domain data, might not generalize if one part is in extreme conditions

Temporal (VIBE, TCMR): - Strengths: Smooth, physically plausible output, can handle momentary occlusions or drop in keypoint detection by using temporal context - Weaknesses: Often need a buffering of frames (lag or offline processing), and might not resolve inherently ambiguous poses

Transformer (MotionBERT): - Strengths: Can capture long-term dependencies and be pretrained on huge data, thus very robust - Weaknesses: Typically requires a lot of training data and computing power, might rely on external 2D detections that have their own errors (if not end-to-end)

Conclusion

Learning-based methods for fitting SMPL to images have dramatically advanced the state of the art. By combining deep networks with model knowledge (the SMPL parametric space), they achieve fast and accurate 3D human pose and shape estimation.

The field has progressed from basic CNN regressors with weak priors to sophisticated systems that integrate optimization, multi-scale reasoning, and temporal modeling. These methods form the foundation for downstream tasks in vision (human action recognition, AR/VR avatar creation, biomechanical analysis, etc.), and many are available as open-source projects.

In future lectures, we will explore how these pose/shape estimators can be extended to handle clothing, hair, and interaction with objects, moving closer to capturing the full complexity of humans in the real world.