Lecture 01.3 – Introduction to Human Models (Overview)

Lecture Slides: Introduction to Human Models

This lecture presents a comprehensive overview of human body modeling, from historical roots to state-of-the-art techniques. We explore how knowledge from anatomy, computer vision, computer graphics, and biomechanics converges to create digital representations of human shape, motion, and behavior.

1. Historical Context

Human body modeling has evolved through centuries of scientific investigation:

Early Scientific Studies

  • Weber Brothers (1836): Conducted one of the first quantitative gait analyses, measuring timing and distances in human walking.

  • Marey and Muybridge (1870s-1880s): Pioneered sequential photography (chronophotography) to capture and analyze human motion.

  • Braune and Fischer (1890s): Applied Newtonian mechanics to study body-segment motion, calculating joint forces and energy expenditure during locomotion.

Mid-20th Century to Digital Era

  • Biomechanical Research: Rehabilitation needs for World War II veterans spurred comprehensive gait studies at the University of California in the 1950s.

  • Computer Graphics (1970s-1980s): - Phong’s illumination model (1975) improved rendering of 3D surfaces - Fred Parke (1972) created the first 3D facial models

  • Motion Capture Development: - Tom Calvert’s goniometer suit (1983) for medical motion capture - Marker-based optical systems emerged in the late 1980s - Vicon systems with reflective markers became standard in the 1990s

21st Century Advances

  • Markerless Motion Capture: - Hogg’s work (1983) demonstrated tracking walking figures from video - Multi-camera systems in the 2000s enabled visual hull reconstruction - Depth sensors (Microsoft Kinect, 2010) accelerated markerless capture

  • Deep Learning Revolution: - Convolutional networks for 2D and 3D pose estimation (OpenPose, DeepPose) - Parametric body models like SMPL enabled single-image 3D reconstruction

  • Behavior Synthesis: - From keyframe animation and physical simulations (1980s-1990s) - To motion graphs for recombining captured clips (2000s) - Modern deep learning approaches for generating realistic movements

Today’s human body models combine anatomical insight, physics, and data-driven learning to achieve unprecedented realism and functionality.

2. Mathematical Foundations

Parametric Body Models

The Skinned Multi-Person Linear (SMPL) model exemplifies modern parametric approaches:

\[M(\boldsymbol{\theta}, \boldsymbol{\beta}) : \mathbb{R}^{|\theta| + |\beta|} \rightarrow \mathbb{R}^{3N}\]

where: - \(\boldsymbol{\theta}\) represents pose parameters (joint angles, typically 72 parameters for 24 joints) - \(\boldsymbol{\beta}\) represents shape parameters (typically 10 principal components) - \(N\) is the number of mesh vertices (6890 in SMPL)

SMPL can be factored into: 1. Base mesh (mean shape) 2. Shape blend shapes (scaled by \(\boldsymbol{\beta}\)) 3. Pose blend shapes (dependent on \(\boldsymbol{\theta}\)) 4. Skeleton-driven deformation via linear blend skinning

This creates a differentiable, low-dimensional representation that can be efficiently optimized.

Implicit Surface Representations

Alternative to meshes, implicit functions define the body as a level set:

\[\text{Surface} = \{\mathbf{x} \in \mathbb{R}^3 : f(\mathbf{x}) = 0\}\]

Common implicit representations include: - Signed Distance Functions (SDFs): \(f(\mathbf{x})\) gives distance to surface (positive outside, negative inside) - Occupancy Functions: Binary inside/outside classification

Neural networks can approximate these functions: - DeepSDF: MLPs outputting distance values for query points - Neural Articulated Shape Approximation (NASA): Implicit functions conditioned on pose

Kinematic Modeling

Human movement is modeled as an articulated figure:

  • Forward Kinematics (FK): Computing limb positions from joint angles - Global transform of joint \(j\): \(G_j = G_{\text{parent}(j)} \cdot \text{Trans}(L_{\text{parent}(j)}) \cdot R_j(\theta_j)\)

  • Inverse Kinematics (IK): Solving for joint angles given desired end-effector positions - Often uses Jacobian \(J(\boldsymbol{\theta}) = \frac{\partial \mathbf{p}}{\partial \boldsymbol{\theta}}\) relating joint angle changes to end-effector position changes

  • Skinning: Vertex position \(v_i'\) is computed as \(v_i' = \sum_j w_{ij} (\mathbf{T}_j(\theta) \cdot v_i)\) where \(w_{ij}\) are skinning weights

For pose and shape estimation, optimization seeks parameters that minimize the distance between model and observations, often using iterative methods or learning-based approaches.

3. Image Formation and Rendering

Projecting 3D humans to 2D images involves several processes:

Camera Models

The pinhole camera model provides the foundation:

\[(u, v) = \left(f \frac{X}{Z} + c_x, f \frac{Y}{Z} + c_y\right)\]

where: - \((X, Y, Z)\) are 3D coordinates in camera space - \(f\) is focal length - \((c_x, c_y)\) is the principal point

Camera extrinsic parameters (rotation \(R\), translation \(t\)) transform world coordinates to camera coordinates before projection.

Shading and Visibility

  • Lambertian shading: Surface brightness proportional to \(I = \rho \, (\mathbf{n} \cdot \mathbf{l})\) where \(\mathbf{n}\) is surface normal and \(\mathbf{l}\) is light direction

  • Phong model: Adds specular highlights for more realistic rendering

  • Z-buffer: Resolves visibility by keeping only the nearest surface at each pixel

  • Silhouettes: In multi-view setups, combining silhouettes creates visual hulls approximating the 3D volume of a person

Differentiable Rendering

Recent advances make the rendering process differentiable, enabling gradient-based optimization:

  • Softened rasterization: Allows gradients to flow even through discrete operations

  • End-to-end optimization: Neural networks can be trained to predict body parameters by comparing rendered projections with input images

  • Self-supervised learning: Using image synthesis error as a loss when 3D ground truth is unavailable

This capability allows fitting 3D human models to 2D observations by iteratively refining the model to align with the input image.

4. Surface Representation Methods

Two dominant approaches represent human body geometry:

Explicit Mesh Models

  • Fixed topology: Surface represented by vertices connected in a consistent mesh structure (e.g., SMPL with 6890 vertices and ~13,776 triangular faces)

  • Blendshapes: Shape variations expressed as vertex displacements from a template mesh - SMPL uses linear combinations of learned shape basis vectors

  • Advantages: - Efficient rendering on graphics hardware - Direct semantic correspondence across shapes - Simple animation via skinning - Easy texture mapping and collision detection

  • Limitations: - Cannot handle topology changes - Fixed resolution (more details require more vertices)

Implicit Function Models

  • Continuous field: Body defined as level set of a function in 3D space - Neural networks can approximate these fields (e.g., DeepSDF, NASA)

  • Advantages: - Topological flexibility (can represent open jackets, loose clothing) - Arbitrary resolution (can be sampled at any density) - Natural handling of complex geometry - Continuous surfaces and gradients

  • Limitations: - Computationally expensive to render - Harder to animate in real-time - Less direct control for artists

Hybrid approaches combine explicit models for coarse structure with implicit functions for high-resolution details.

5. Motion Capture and Behavior Synthesis

Capturing Human Motion

Marker-Based Systems: - Optical motion capture: Reflective markers tracked by infrared cameras - Inertial systems: IMUs measuring orientation and acceleration on each limb - Advantages: High accuracy, temporal resolution - Limitations: Requires specialized equipment, markers can interfere with natural movement

Markerless Approaches: - Multi-camera systems: Reconstruct visual hulls from silhouettes - Deep learning: Models like OpenPose detect 2D keypoints from regular video - Model-fitting: SMPLify optimizes 3D body model to match 2D detections - End-to-end networks: HMR, VIBE directly regress SMPL parameters from images/video

Sparse Sensing: - Recent work shows as few as 5 IMUs can reconstruct full body pose - Learning fills gaps in sparse observations using motion priors

Behavior Synthesis

Motion Graphs and Clip-Based Methods: - Stitch existing motion clips at compatible transitions - Introduced by Kovar et al. (2002) - Good for interactive control with available motion data

Physics-Based Simulation: - Model body as articulated rigid bodies with physics - Apply joint torques to generate movement - Examples include Hodgins et al. (1995) simulating athletic movements

Deep Learning Approaches: - Generative models: VAEs, GANs, diffusion models learn motion distributions - Can be conditioned on music, action labels, or other high-level inputs - Example: DeepMimic (Peng et al. 2018) uses reinforcement learning to imitate mocap clips

Hybrid Methods: - Combine data-driven motion with physics constraints - Xie et al. (2021) incorporate physics into training from video data - Ensure plausible dynamics while leveraging large datasets

6. Clothing Modeling

Realistic virtual humans require clothing that moves naturally:

Physically-Based Simulation

  • Mass-spring systems: Cloth as mesh with physical forces

  • Finite element methods: More accurate but computationally expensive

  • Baraff & Witkin (1998): Pioneered efficient implicit integration for cloth

\[E = \text{Elastic forces} + \text{Gravity} + \text{Collision response}\]
  • Advantages: Realistic dynamics for any movement

  • Limitations: Computationally intensive, requires accurate material parameters

Data-Driven Approaches

  • Garment shape spaces: Learn how clothing deforms with different poses

  • TailorNet: Neural network predicting clothing deformation from body pose and shape

  • Displacement models: Map offsets from body surface to clothing

  • Advantages: Fast runtime performance after training

  • Limitations: Limited to training distribution of poses/shapes

Implicit Clothing Models

  • Neural implicit functions: Represent clothing as level sets

  • BCNet: Two-layer model with body and cloth as separate implicit surfaces

  • Advantages: Handle topology changes (open jackets, loose garments)

  • Limitations: More complex to train and render

Layered approaches combine body models with separate clothing models, enabling transfer between different bodies while maintaining natural movement.

7. Human-Object Interaction

Modeling interactions between humans and their environment:

Physics-Based Methods

  • Contact constraints: Ensure no penetration, appropriate reaction forces

  • Motion planning: Find trajectories that accomplish tasks while obeying physics

  • Contact-Invariant Optimization: Mordatch et al. (2012) optimized motion with contact variables

  • Applications: Sitting, climbing, manipulating objects with proper physics

Learning-Based Approaches

  • Affordances: Learn which objects allow which actions (chairs afford sitting)

  • PROX: Hassan et al. (2019) captured realistic human-scene interactions

  • Pose prediction: Generate appropriate human poses near specific objects

  • Applications: Scene population, interaction prediction, ergonomic assessment

Hybrid Systems

  • Reinforce learning for tasks: Learn to sit (ICLR 2020) used neural policies for chair interactions

  • COUCH (2021): Combined data-driven pose synthesis with controllable contact points

  • Applications: Interactive virtual humans that respond naturally to environments

Human-object interaction modeling is crucial for virtual reality, robotics, and digital human simulations that involve realistic environmental interaction.

8. Applications

Virtual human models power applications across numerous domains:

Entertainment and Media

  • Film and Animation: Digital characters and crowds in movies

  • Video Games: Real-time character control and procedural animation

  • Virtual Reality: Avatars representing users in immersive environments

Healthcare and Biomechanics

  • Gait Analysis: Quantify walking patterns for diagnosis and treatment

  • Rehabilitation: Track and assess patient movements during therapy

  • Surgical Planning: Patient-specific anatomical models

  • Sports Performance: Technique analysis and injury prevention

Engineering and Design

  • Ergonomics: Design workspaces and products for human comfort

  • Robotics: Human-robot interaction and collaborative environments

  • Autonomous Systems: Pedestrian tracking and behavior prediction

Human-Computer Interaction

  • Gesture Recognition: Body-based input for interfaces

  • Virtual Try-On: Visualize clothing on personalized avatars

  • Accessibility: Design interfaces for diverse body types and abilities

Scientific Research

  • Psychology: Study body language and non-verbal communication

  • Anthropology: Analyze human movement across cultures

  • Forensics: Reconstruct accidents or crime scenes

9. Challenges and Future Directions

Despite significant progress, several challenges remain:

Scalability and Generalization

  • Population Diversity: Current models often lack coverage of children, elderly, or unusual body types

  • Motion Diversity: Rare or extreme actions may fall outside training distributions

  • Computational Efficiency: High-fidelity models require significant resources

Higher-Fidelity Dynamics

  • Soft Tissue: Modeling fat and muscle jiggling during movement

  • Fine Details: Realistic facial expressions and hand articulation

  • Secondary Motion: Cloth, hair, and accessories with physical accuracy

Data and Labeling Constraints

  • Ground Truth: Difficult to obtain accurate 3D pose for in-the-wild data

  • Contact Information: Precisely capturing where and how bodies interact with objects

  • Privacy Concerns: Ethical use of motion data that may be identifying

Physics and Learning Integration

  • Physical Plausibility: Learned models can produce physically impossible results

  • Differentiable Physics: Backpropagating through simulations for training

  • Simulation-to-Real Gap: Ensuring models transfer from simulation to real data

Semantic and Cognitive Aspects

  • Action Planning: High-level decision making for autonomous virtual humans

  • Social Behavior: Modeling gestures, personal space, and interaction norms

  • Context Awareness: Understanding environmental constraints and affordances

Realism vs. Controllability

  • Multi-Level Control: Balancing high-level commands with low-level physics

  • Real-Time Performance: Maintaining realism under interactive constraints

  • Artist Tools: Providing intuitive interfaces for animation and control

The future likely holds unified models combining shape, motion, clothing, and intention in a single framework, enabling applications from immersive telepresence to autonomous digital humans that interact naturally with users.