STUDENT TUTORIAL

Animatable Avatar Pipeline

Transform multi-view photographs into animated 3D human avatars with motion transfer.

The Core Problem

Given N images of a person, recover the 3D body model parameters Θ = (β, θ, τ) that explain the observations.

COLMAP → Preprocess → SMPL-X Fit → Texture → Motion → Retarget → VAT → Viewer

10,475

vertices

60

fps playback

83×

faster texturing

1 / 14

Pipeline Overview

Each stage transforms data representation, progressively constraining the solution space.

1

COLMAP

Multi-view stereo recovers 3D geometry with vertex colors

N × H×W×3 images → ~1.5M vertices + colors

2

Preprocess

Remove artifacts, fix topology, normalize scale

1.5M noisy vertices → 300-500K clean vertices

3

SMPL-X Fit

Constrain to anatomically plausible body; gain skeleton

300-500K target points → 10,475 rigged vertices

4

Color Transfer

k-NN maps scan colors to SMPL-X vertices once; persistent

COLMAP colors + SMPL-X → 10,475 vertex colors

5

Motion Retarget

Transfer any motion to our body shape; colors follow via LBS

SMPL-X + MOYO sequence → T frames × (xyz + rgb)

6

Retarget to COLMAP (optional)

Transfer SMPL-X skinning weights to original mesh; animate full-detail geometry

SMPL-X weights + COLMAP → ~1M rigged vertices

7

VAT Export (optional)

GPU-friendly format for real-time playback

T × 10,475 × (xyz+rgb) → 3 PNGs + JSON

The Dimensionality Story

We start with millions of pixels, reconstruct ~1.5M colored 3D points, then collapse geometry to 119 parameters: β (10 shape), θ (75 pose), hands (24 PCA), ψ (10 expression). SMPL-X acts as a strong prior. Optional: retarget weights back to COLMAP for full-detail animation.

N×H×W×3 pixels

↓

1.5M pts + rgb

↓

119 params + rgb

↓

1M+ retarget (opt)

2 / 14

Stage 1: Image Capture

Video Capture Workflow (Recommended)

📱

1. Record Video

3 orbits, 3 angles

🎬

2. Extract Frames

FFmpeg (adjust fps)

📁

3. Feed to COLMAP

Ordered sequence

Coverage: 3 Orbits × 3 Camera Angles

1. Crouching ↑ — capture under chin
2. Standing → — eye level
3. Arms up, aim ↓ — top of head

Walk slowly! Adjacent frames need 70% shared content.

100+

frames min

1080p+

resolution

3

angles

What is "overlap"?

Adjacent frames must share ~70% of visible content. This means walking slowly—if you turn too fast, consecutive frames have no common features and COLMAP can't triangulate.

Subject Pose: T-Pose

Why T-Pose?

• Color coverage: Exposes underarms, inner arms, sides
• Matches SMPL-X default pose
• Minimizes self-occlusion
• Best for motion retargeting

Avoid These Mistakes

✗ Subject moving during capture
✗ Arms at sides / hands in pockets
✗ Harsh shadows / mixed lighting
✗ Gaps in coverage

3 / 14

Stage 2: COLMAP Reconstruction

1

Features

→

2

Match

→

3

Sparse

→

4

Undistort

→

5

Stereo

→

6

Fusion

→

7

Mesh

# Feature extraction (with optional mask path)
colmap feature_extractor --database_path db.db --image_path ./images \
    --ImageReader.camera_model SIMPLE_RADIAL --ImageReader.single_camera 1 \
    --ImageReader.mask_path ./masks/  # optional

# Matching: try sequential first (video), fallback to exhaustive
colmap sequential_matcher --database_path db.db --SequentialMatching.overlap 10

# Sparse → Dense → Mesh
colmap mapper --database_path db.db --image_path ./images --output_path sparse
colmap image_undistorter --input_path sparse/0 --output_path dense
colmap patch_match_stereo --workspace_path dense
colmap stereo_fusion --workspace_path dense --output_path dense/fused.ply
colmap poisson_mesher --input_path dense/fused.ply --output_path dense/mesh.ply \
    --PoissonMeshing.depth 12

1. Input Video

2. Segmented

3. Feature Matches

4. Point Cloud

5. 3D Mesh

Optional: Provide Masks

Don't remove backgrounds—black/green bleeds into reconstruction! Provide binary masks via --ImageReader.mask_path

Validation Checklist

After features: 2,000-10,000 per image

After matching: 100-1,000+ matches/pair

After sparse: 70-90% images registered

After fusion: millions of points

Common Failures

<500 features: blurry or textureless

<50% registered: insufficient overlap

Sequential fails: try exhaustive_matcher

Poisson Depth → Vertex Count

depth 8 → ~2K verts (preview)

depth 10 → ~42K verts

depth 11 → ~256K verts

depth 12+ → ~1.4M+ verts ⭐

Key Outputs

sparse/0/ ← camera poses

dense/fused.ply ← point cloud

dense/mesh.ply ← final mesh

4 / 14

Stage 3: Mesh Preprocessing

COLMAP mesh

~1-2M verts, Y-down

→ 1. Clean → 2. Simplify → 3. Transform →

Ready for fitting

~50k verts, Y-up, meters

1. Clean

# Remove degenerate geometry
mesh.remove_degenerate_triangles()
mesh.remove_duplicated_vertices()
mesh.remove_non_manifold_edges()

# Keep largest connected component
tri_clusters = mesh.cluster_connected_triangles()
largest = np.argmax(cluster_n)
mesh.remove_triangles_by_mask(mask)

# Poisson → watertight
pcd.estimate_normals()
mesh = o3d.geometry.TriangleMesh\
    .create_from_point_cloud_poisson(pcd, depth=9)

2. Simplify

# Save colors before (may be lost)
old_colors = np.asarray(mesh.vertex_colors)

# Quadric decimation
mesh = mesh.simplify_quadric_decimation(
    target_number_of_triangles=100000)

# Transfer colors via k-NN
tree = cKDTree(old_verts)
_, idx = tree.query(np.asarray(mesh.vertices))
mesh.vertex_colors = old_colors[idx]

# Recompute normals
mesh.compute_vertex_normals()

3. Transform

# Scale to target height (meters)
h = verts[:,1].max() - verts[:,1].min()
verts *= (1.7 / h)

# Flip Y: COLMAP Y-down → SMPL-X Y-up
verts[:,1] *= -1

# Center: pelvis at origin (~0.9m up)
verts[:,1] -= verts[:,1].min() + 0.9
verts[:,0] -= verts[:,0].mean()
verts[:,2] -= verts[:,2].mean()

Coordinate Systems

COLMAP: Y-down ↓

SMPL-X: Y-up ↑

verts[:,1] *= -1

Why Poisson Reconstruction?

COLMAP meshes have holes and non-manifold edges. Poisson creates a watertight surface required for skinning.

Why Simplify to 50k?

SMPL-X fitting computes distances from every vertex. 1M+ vertices → GPU OOM errors.

5 / 14

Stage 4: SMPL-X Model Fitting

Photos → COLMAP → Clean → SMPL-X Fit → Texture → Motion

What is SMPL-X?

SMPL eXpressive — unified body model combining SMPL body, FLAME face, MANO hands.

Vertices:10,475

Joints:54

Shape (β):10 params

Body pose:63 params

Hands (PCA):24 params

Expression (ψ):10 params

Chamfer Distance Loss

L chamfer = 1 / |A| Σ min‖a-b‖₂ + 1 / |B| Σ min‖b-a‖₂

# Pairwise distances
dist = torch.cdist(smplx_v, scan_v)
loss_s2t = dist.min(dim=1)[0].mean()
loss_t2s = dist.min(dim=0)[0].mean()
chamfer = (loss_s2t + loss_t2s) / 2

SMPL-X Forward Pass

import smplx
model = smplx.create(
    model_path='models/',
    model_type='smplx',
    gender='neutral',
    num_betas=10
).to(device)

output = model(
    betas=betas,        # (1,10)
    body_pose=pose,     # (1,63)
    global_orient=orient,# (1,3)
    transl=transl       # (1,3)
)
vertices = output.vertices  # (1,10475,3)

Multi-Stage Optimization

1 Global: orient, transl, scale

2 Shape: β (body proportions)

3 Coarse pose: HIGH regularization

4 Fine pose: lower regularization

5 Joint: all params together

Pose Presets

Preset	Shoulder Z	Use When
t-pose ✓	0°	Arms horizontal (recommended)
a-pose	±45°	Arms diagonal
relaxed	±72°	Natural standing

6 / 14

⚠️ Pose Initialization Matters!

Wrong initialization → optimizer must rotate joints through large angles → twisted limbs and incorrect color transfer.

Shoulder Z-rotation: t-pose: 0° a-pose: ±45° relaxed: ±72° ✓ arms-at-side: ±90°

❌ T-Pose Init (scan was relaxed)

✓ Relaxed Init (matches scan)

~ A-Pose Init (27° off)

❌ Arms Above Head (severe)

💡 Best practice: Capture subject in T-pose → default init (zeros) works perfectly.

7 / 14

Stage 5: Persistent Color Mapping

COLMAP → Clean → SMPL-X Fit → Color Map → Motion → VAT

❌ Naive Approach: Per-Frame k-NN

# For EVERY frame...
for frame in frames:
    posed_smplx = apply_pose(smplx, pose)
    colors = knn(posed_smplx, colmap_mesh)
    # But COLMAP is in original pose!

Problem: COLMAP mesh is static. When SMPL-X pose changes, k-NN finds wrong neighbors from the frozen scan.

🔑 Key Insight: LBS Preserves Vertex Indices

Linear Blend Skinning transforms vertex positions, not identities:

v i (θ) = Σ j w ij \cdot G j (θ) \cdot v i rest

✓ Vertex 5432 is always vertex 5432 (e.g., tip of nose)
✓ Colors assigned to index persist through motion

✅ The Solution: Compute Once, Reuse Forever

# Compute ONCE in fitted pose
colors = knn_interpolate(fitted_smplx, colmap_mesh)
np.savez('vertex_colors.npz', colors=colors)

# For ALL frames - just load!
colors = np.load('vertex_colors.npz')['colors']
mesh.visual.vertex_colors = colors
# Colors "travel" with vertices via LBS

k-NN with Distance Weighting

from scipy.spatial import cKDTree

# Build tree from COLMAP vertices
tree = cKDTree(colmap_verts)

# Query k=8 nearest for each SMPL-X vertex
dists, idxs = tree.query(smplx_verts, k=8)

# Distance-weighted interpolation
weights = 1.0 / (dists + 1e-8)
weights /= weights.sum(axis=1, keepdims=True)

# Weighted average of neighbor colors
colors = np.einsum('nk,nkc->nc',
    weights, colmap_colors[idxs])

Why This Works

The key realization: Color is a property of the vertex, not the position.

Once you assign "red" to vertex 5432 (the nose), it stays red whether the nose is at (0,0,1) or (0.5, 0.2, 1.1).

Why Per-Frame Fails

❌

Per-frame: Arm raises → arm vertices near COLMAP torso → k-NN returns torso colors

✅

Persistent: Colors computed when aligned → locked to vertex index → correct colors travel with arm

8 / 14

Stage 6: Motion Application

SMPL-X Fit → Color Map → Motion → Retarget → VAT

📦 MoCap Datasets

AMASS — 40h+ unified archive MOYO — 200 yoga poses Motion-X — 15.6M frames GRAB — Hand grasping

🎥 Video → SMPL

WHAM ⭐ — CVPR'24 SOTA SMPLer-X — Whole-body 4DHumans — HMR 2.0

Core Insight: Shape β stays constant (your body). Pose θ varies per frame (the motion).

🔄 Motion Transfer Pipeline

1 Load Fitted β + Scale

2 Load Persistent Colors

3 Load Motion Sequence

4 Center Translation ⚠️

5 Generate Per-Frame Mesh

6 Apply Colors & Export

⚠️ Translation: Raw MoCap = absolute coords. Center to avoid teleporting on loop! transl -= transl[0:1]

import torch
import trimesh
import smplx

# 1. Load YOUR fitted parameters
params = torch.load('fitted/smplx_parameters.pt')
betas = params['betas']      # Shape: your proportions
scale = params.get('scale', torch.tensor([1.0]))

# 2. Load persistent vertex colors
fitted = trimesh.load('fitted/fitted_smplx_colored.ply')
colors = fitted.visual.vertex_colors[:, :3]

# 3. Load motion sequence
data = pickle.load(open('motion.pkl', 'rb'))

# 4. CENTER TRANSLATION (critical for loops!)
transl = data['transl'] - data['transl'][0:1]

# 5. Initialize SMPL-X model
body_model = smplx.create(
    model_path='models/smplx',
    model_type='smplx',
    gender='neutral'
)

# 6. Generate each frame
for i in range(num_frames):
    output = body_model(
        betas=betas,                    # YOUR shape
        global_orient=data['global_orient'][i:i+1],
        body_pose=data['body_pose'][i:i+1],
        transl=transl[i:i+1]
    )
    
    verts = output.vertices[0].numpy() * scale.item()
    mesh = trimesh.Trimesh(verts, body_model.faces)
    mesh.visual.vertex_colors = colors
    mesh.export(f'frames/frame_{i:05d}.ply')

9 / 14

Stage 7: VAT Conversion (Optional)

Color Map → Motion → Retarget → VAT

🎬 What is VAT?

Vertex Animation Textures encode mesh animation directly into image textures, shifting computation from CPU to GPU.

Each pixel= 1 vertex at 1 frame

R channel= X position

G channel= Y position

B channel= Z position

📈 PLY vs VAT

Metric	PLY	VAT
Load	30s–3min	<1s
Size	0.7–2.5GB	15–50MB
FPS	~15	60+

75×

load

50×

size

4×

fps

GPU

native

🔢 16-bit Precision

8-bit PNG (0-255) lacks precision. Split 16-bit into high/low textures:

# Python encode
norm = (pos - min) / (max - min)
enc16 = (norm * 65535).astype(np.uint16)
high_8 = (enc16 >> 8).astype(np.uint8)
low_8 = (enc16 & 0xFF).astype(np.uint8)

// GLSL vertex shader decode
vec3 hi = texture2D(highTex, uv).rgb * 255.0;
vec3 lo = texture2D(lowTex, uv).rgb * 255.0;
vec3 n = (hi * 256.0 + lo) / 65535.0;
return mix(minBounds, maxBounds, n);

📁 Output Files

motion_name/

├── position_high.png

├── position_low.png

├── color_texture.png

└── metadata.json

⚡ Automatic Chunking

Mobile GPUs limit textures to 4096×4096. With 10,475 vertices → max 1,365 frames per texture. Long motions auto-split into seamless chunks.

VAT Playback Demo

# Convert with auto-chunking
python convert_vat_chunked.py \\
    moyo_frames/ \\
    -o vat_universal/ \\
    -j 8  # 8 parallel workers

10 / 14

🎉 Final Result: Animated SMPL-X Avatar

📷→🧍

Photos to Avatar

Multi-view reconstruction

🎨

Color Transfer

COLMAP → SMPL-X vertices

💃

Motion Ready

Any MoCap dataset

🌐

Web Playback

60fps via VAT

✅ What This Pipeline Produces

• SMPL-X parametric mesh (10,475 vertices)
• Vertex colors from COLMAP reconstruction
• Fully rigged and animatable body model
• Compatible with any SMPL-X motion data

🔮 Next Step: True Photorealism

• Transfer skinning weights to COLMAP mesh
• Preserve full geometric detail (millions of verts)
• UV-mapped texture instead of vertex colors

11 / 14

🚀 Advancement: Hybrid Mesh Animation

Transfer SMPL-X skinning weights to original COLMAP meshes — millions of vertices with full geometric detail, now animatable!

SMPL-X Only

10,475 vertices • Vertex colors • Fast rendering

Parametric 83× faster

🌟 Hybrid Model

COLMAP mesh + SMPL-X skeleton • Full detail

High-res Animatable

Side-by-Side Comparison

SMPL-X (blue) vs Hybrid (original colors)

Validation

Key Innovation: Weight Transfer

• Barycentric interpolation of LBS weights from SMPL-X to COLMAP vertices
• Local coordinate frame transformation preserves geometric detail
• Distance thresholds handle background geometry
• GPU acceleration via PyTorch3D for 1M+ vertices

Technical Approach

# For each COLMAP vertex:
# 1. Find nearest SMPL-X triangle
# 2. Compute barycentric coordinates
# 3. Interpolate blend weights
weights_colmap = bary_interp(smplx_weights, 
                              smplx_faces, 
                              colmap_verts)

12 / 14

Hybrid Models: Additional Examples

SMPL-X Only (Subject 2)

10,475 vertices • Fast but low detail

🌟 Hybrid Model (Subject 2)

Full COLMAP detail preserved

Comparison (Subject 2)

Notice clothing detail preservation

Result: True photorealistic avatars with millions of vertices, fully animatable with any SMPL-X motion sequence.

13 / 14

Pipeline Summary & Next Steps

📋 Complete Pipeline

1

📷 Capture — 50-200 photos, T-pose

2

🏗️ COLMAP — SfM → Dense → Mesh

3

🧹 Preprocess — Clean, simplify, scale

4

🎯 SMPL-X Fit — Chamfer optimization

5

🎨 Color Map — Persistent k-NN

6

💃 Motion — MOYO/AMASS retargeting

7

🌐 VAT — Web-ready (optional)

✅ Output: Animated Avatar

10,475 vertices • vertex colors • 60fps

⚠️ Common Issues & Solutions

❌ SMPL-X arms twisted

✅ Match preset to scan pose

❌ Wrong colors on posed mesh

✅ Use persistent color map from fitted pose

❌ Avatar teleports in animation

✅ Subtract first frame translation

❌ Mesh upside down / rotated

✅ Apply Y-flip for COLMAP→SMPL-X

❌ COLMAP reconstruction sparse

✅ More photos, 70%+ overlap

💡 Key Insights

Pose initialization matters

Match preset to actual scan pose

Persistent color mapping

Establish once, reuse for all poses

Coordinate systems

COLMAP Y-down ↔ SMPL-X Y-up

Translation centering

Relative motion for seamless loops

🔮 Future Directions

• Real-time LBS (500-800× storage reduction)
• Neural rendering (NeRF, Gaussian Splatting)
• Multi-camera professional capture
• Physics-based cloth simulation

🙋

Questions?

14 / 14