.. _lecture_fitting_smpl_to_imu_learning:

Lecture 07.2: Fitting SMPL to IMU Data Using Learning-Based Methods
============================================================================

.. raw:: html

   <iframe width="600" height="400" src="https://www.youtube.com/embed/HLbuRR3RIdk" 
   title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; 
   encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

`Lecture Slides: Fitting SMPL to IMU Data Using Learning-Based Methods <https://virtualhumans.mpi-inf.mpg.de/VH23/slides/pdf/Lecture_07_2_Fitting_SMPL_to_IMU_Learning.pdf>`_

Introduction
---------------

Estimating human body pose from inertial sensor data lies at the intersection of graphics, vision, and machine learning. Wearable IMUs (Inertial Measurement Units) provide orientation and acceleration readings for the body parts they're attached to, but unlike camera-based motion capture, they don't directly provide global position information. Our task is to fit a parametric human model (SMPL) to a sparse set of IMU signals – typically only 6 sensors on the body – and recover the full 3D pose.

This problem is severely under-constrained: many different body poses can produce the same set of IMU readings. Traditional solutions used optimization to enforce physical consistency, but these tend to be slow and operate offline. Recent learning-based approaches leverage data-driven priors to solve IMU-to-pose estimation in real time.

This chapter provides a comprehensive overview of learning-based methods for IMU-driven pose reconstruction, focusing on techniques that output SMPL pose parameters from sparse IMUs.

Optimization-Based vs. Learning-Based Approaches
-------------------------------------------------

Early approaches to IMU-based pose estimation treated it as an optimization problem: given IMU sensor measurements, find the body pose parameters that best reproduce those measurements using a human model.

**Optimization-Based Approaches**

A seminal example is Sparse Inertial Poser (SIP) by von Marcard et al. (2017). SIP attaches 6 IMUs (on the wrists, lower legs, head, and back) to a subject and fits the SMPL model's pose parameters :math:`\boldsymbol{\theta}` (and shape :math:`\boldsymbol{\beta}`) such that the model's virtual sensors have orientations and accelerations matching the measured IMUs over time.

SIP uses a realistic statistical body model (SMPL) to impose anthropometric constraints and performs a joint optimization over multiple frames. This yields physically plausible pose sequences from only sparse sensors, even for challenging motions. However, SIP is an offline method – it requires minutes of computation for a motion sequence – and cannot run in real time. Moreover, solving the non-linear least-squares fitting can be sensitive to initialization and might still drift without strong priors.

**Learning-Based Approaches**

Learning-based approaches replace the iterative optimization with a trained predictive model (typically a deep neural network). The idea is to have a model learn the mapping from IMU sensor signals to human pose, based on examples. This was first demonstrated by Deep Inertial Poser (DIP) in 2018.

Learning-based methods bring several advantages:
- Once trained, they are extremely fast at runtime (achieving real-time inference, e.g., 60–90 FPS)
- They can implicitly learn human motion priors from data rather than relying on hand-crafted regularizers
- They naturally handle temporal dependencies by training on motion sequences

On the downside, learning methods require large amounts of training data (with ground-truth poses), which can be hard to obtain from IMU setups. They may also struggle with generalization to motions or subjects not well-represented in training data and can produce implausible results if asked to extrapolate beyond their learned priors.

**Hybrid Approaches**

A hybrid approach attempts to combine the strengths of both. For example, the Physics Informed/Physical Inertial Poser (PIP) method (2022) uses a neural network to produce an initial pose estimate quickly (the "kinematics" stage) and then runs a lightweight physics-based optimization to refine the pose so that it obeys physical constraints (ground contact, dynamics consistency).

This two-stage strategy still runs in real time (PIP achieves 60 FPS with 16 ms latency) and improves accuracy and physical realism over pure learning methods. Hybrid approaches highlight an important trend: purely data-driven models can be augmented with domain knowledge (physics, kinematics) to curb their failure modes (like foot skating or gradual drift).

In summary:
- Optimization-based methods (like SIP) excel in accuracy and physical consistency but are offline and require careful tuning
- Learning-based methods (like DIP and successors) enable real-time motion capture from IMUs at the cost of needing extensive training data
- Hybrid methods strive for the best of both: using learning for speed and optimization/physics for accuracy and plausibility

Learning-Based IMU-to-Pose Estimation: Historical Overview of Key Models
------------------------------------------------------------------------

Deep Inertial Poser (DIP, 2018)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Deep Inertial Poser (DIP) was the first deep learning method to estimate full-body 3D pose from a sparse set of wearable IMUs. Published at SIGGRAPH Asia 2018 (Huang et al.), DIP demonstrated that a neural network can replace the expensive optimization used in SIP and still recover accurate poses in real time. DIP uses only 6 IMUs (on the lower legs, wrists, head, and pelvis/back) and outputs the 3D pose in the format of SMPL's joint rotation parameters.

**Data and Training:**

One challenge DIP faced was the lack of large paired IMU–pose datasets. To train the network, DIP generated synthetic IMU data from existing motion capture datasets. The authors leveraged the newly introduced AMASS dataset (a large collection of mocap sequences retargeted to SMPL format) to obtain diverse human motions.

Using the SMPL model's kinematics, they simulated IMU readings (sensor orientations and accelerations) for each motion by placing virtual sensors on the SMPL body and computing their orientation and linear acceleration over time. This provided effectively unlimited training data in a unified format.

DIP also recorded a smaller real dataset called DIP-IMU for validation and fine-tuning. The DIP-IMU dataset consists of 10 subjects wearing 17 Xsens IMUs, with ground-truth SMPL poses obtained via a full-suit optimization; it contains 64 sequences (about 330k frames) and is one of the largest IMU-motion datasets made public.

**Network Architecture:**

DIP employs a recurrent neural network (RNN) to model the temporal sequence of poses. In particular, DIP uses a stacked bi-directional RNN with gated recurrent units (GRUs) or LSTM units that process the IMU signals over time.

At each time step, the input to the network is the collection of sensor measurements from all 6 IMUs (each providing an orientation, represented e.g. as a quaternion or rotation matrix, and a linear acceleration vector). DIP normalizes these measurements carefully – for example, accounting for different coordinate frames (sensor local frame vs. global frame vs. SMPL body frame) so that the network learns in a consistent space.

The bi-directional RNN means that during training the network has access to both past and future IMU readings when estimating the pose at a given frame, which helps accuracy by smoothing over ambiguous frames. At test time, future data isn't available, so DIP is deployed in a sliding window or forward-only manner to maintain real-time performance.

**Output and Loss:**

The network outputs the pose in terms of SMPL joint rotations for each time step. DIP represented rotations in axis-angle form (3 parameters per joint, 72-D output for 24 joints) or a similar rotation representation.

To account for the high uncertainty in this problem – certain IMU configurations don't uniquely determine a pose – DIP's loss function was formulated as a negative log-likelihood of a Gaussian output distribution. In practice, the network predicts both a mean pose and a diagonal covariance (per output dimension) at each time, effectively learning a per-joint uncertainty.

The training then maximizes the likelihood of the ground-truth under that Gaussian (equivalent to a weighted L2 loss where the weights are learned variances). This heteroscedastic regression allows the model to express less confidence (higher predicted variance) in ambiguous degrees of freedom.

**Performance:**

DIP demonstrated, for the first time, real-time 3D pose reconstruction from 6 IMUs. It achieves over 120 fps on a GPU, far exceeding real-time needs, and qualitatively the reconstructed motions are smooth and closely match the ground-truth motion capture.

On standard benchmarks (like TotalCapture and DIP-IMU test data), DIP showed accuracy improvements over the optimization baselines and prior methods, especially in dynamic motions, while running orders of magnitude faster.

TransPose (2021)
^^^^^^^^^^^^^^^^^^

After DIP, one limitation remained: DIP (and similar models) estimated joint rotations but did not explicitly compute global translations of the body. In other words, a subject walking forward would be reconstructed by DIP as walking in place (feet moving, but root position fixed) because IMUs alone do not provide absolute position.

TransPose (Xinyu Yi et al., SIGGRAPH 2021) tackled this by producing both the body pose and the global trajectory from IMUs. TransPose is a DNN-based full motion capture system using 6 IMUs, achieving over 90 fps in real time.

**Pose Estimation Innovations:**

For the body pose, TransPose introduced a multi-stage neural network that estimates joint positions in a hierarchical manner (from extremities inward) before resolving the final joint rotations.

Specifically, the network first predicts the 3D positions of "leaf" joints (e.g., hands, feet, head) relative to the root, then uses these as intermediate constraints to predict the positions of more central joints, and so on, ultimately inferring the full skeletal pose.

By breaking the problem down into leaf-to-root predictions, the network can more easily satisfy kinematic constraints (like where the feet and hands should be) before determining the internal posture. This design improved both accuracy and computational efficiency of pose estimation.

**Global Translation Estimation:**

A key contribution of TransPose is a solution for global body translation (the position of the root in the world). With no direct positional sensors, TransPose uses two strategies:
1. A supporting-foot heuristic
2. An RNN-based predictor

These are fused by a confidence measure. The supporting-foot method leverages the fact that when one foot is on the ground and stationary, the body's horizontal displacement can be inferred by assuming that foot is static on the floor.

The second method is an RNN that directly learns to predict the velocity or displacement of the body from the IMU sequence. TransPose combines these two estimates with a confidence-based fusion: when the supporting-foot detector is confident (e.g., foot contact is certain), that solution is trusted more; otherwise the learned translation is used.

This yields robust global position tracking – for example, a person can walk around a large area and TransPose's virtual avatar will accurately follow the trajectory, something not possible with DIP alone.

**Performance and Impact:**

TransPose reported substantially higher accuracy than prior state-of-the-art (including DIP and SIP) on benchmark datasets. By capturing global motion, it enabled full 3D motion capture from IMUs alone, previously achievable only with additional external references like cameras.

TransPose remained efficient, running at 90+ fps. It was also evaluated on live scenarios (the authors demonstrated ping-pong playing, umbrella walking, etc. with only IMUs) and showed reliable results.

Transformer Inertial Poser (TIP, 2022)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While RNNs were effective for DIP and TransPose, the field soon explored Transformers for modeling temporal sequences of IMU data. Transformer Inertial Poser (TIP) (Jiang et al., SIGGRAPH Asia 2022) is an attention-based model that estimates full-body motion from 6 IMUs and even generates the 3D terrain profile that the person walked on.

TIP addresses some key challenges: maintaining long-term temporal consistency, minimizing drift in global/joint motion, and handling a variety of motions on different terrains.

**Transformer Architecture:**

TIP uses a conditional Transformer decoder architecture. Rather than processing the IMU sequence purely with recurrence, TIP's Transformer can attend over a history window of inputs and past predictions.

It explicitly feeds back the previous pose outputs as input context for the next prediction, thereby "reasoning about the prediction history". This design helps produce consistent predictions and avoid jitter, as the model can learn to not deviate drastically from recent poses and to enforce smooth kinematics.

The self-attention mechanism in the Transformer is well-suited to capturing both short- and long-range dependencies (e.g., patterns in gait or periodic motions) that RNNs might struggle with if not well-trained.

**Stationary Body Points (SBP):**

One of TIP's novel concepts is the introduction of Stationary Body Points (SBPs) as a learning target. SBPs are points on the body that are momentarily stationary with respect to the world (for example, a foot during stance phase, or perhaps the hips when sitting on a chair).

TIP's network is trained to predict which body points are stationary at each time and to output their global positions. These SBPs can be identified robustly by the Transformer. TIP then uses analytical routines to enforce consistency at those points.

For instance, if the network predicts that the left foot is an SBP at frame t (meaning the left foot should be on the ground and not moving), TIP will correct the raw pose prediction by adjusting the left foot's pose to exactly maintain its previous global position (preventing any slight drift or jitter in that foot).

SBP predictions thus act as self-discovered "contact constraints" that the system applies to improve physical realism. This is a form of post-processing using network outputs: the network doesn't directly output a final pose, but rather intermediate signals (SBPs) that allow a deterministic fixing of drift.

**Terrain Generation:**

Because TIP can identify when and where feet are stationary, it can also infer properties of the ground. TIP includes an algorithm to generate a terrain height map from the pattern of SBPs.

Essentially, as the person walks, TIP accumulates the lowest positions of stationary feet and assumes those lie on the ground surface, constructing a coarse height field of the walked terrain. This is useful for visualization (showing the ground the person likely walked on) and more importantly, TIP feeds the terrain information back to correct global motion.

If the terrain is uneven, the global root motion can be adjusted to respect the height map (ensuring the feet meet the ground height). By integrating environment context, TIP goes beyond predicting pose in isolation, making the result more plausible.

**Performance:**

TIP was evaluated on both synthetic data and real IMU recordings, demonstrating improved accuracy over strong baselines like DIP and TransPose. It showed excellent temporal stability, with much reduced drift in joint positions thanks to SBP corrections.

In live demos, TIP could reconstruct motions on various imagined terrains in real time.

Physics/Physical Inertial Poser (PIP, 2022)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While the above learning-based methods implicitly learn kinematic priors, they do not guarantee physical correctness (e.g., a jump might not obey gravity, or predicted joint torques could be implausible).

Physical Inertial Poser (PIP) (Xinyu Yi et al., CVPR 2022) takes a hybrid approach by integrating a physics simulation layer with the neural network. PIP is physics-aware real-time motion capture from 6 IMUs, and notably it estimates not only pose but also joint torques and ground reaction forces – a full dynamics solution.

**Two-Module Approach:**

PIP consists of a neural kinematics module followed by a physics-based optimizer. The first stage is similar to TransPose/DIP: a neural network (in PIP's case, a relatively lightweight one to keep latency low) regresses an initial pose sequence from the IMU data.

This kinematic output on its own might have minor foot skating or may drift over long sequences (since the network isn't perfect). The second stage then takes that initial motion as a reference and runs a short physics simulation/optimization to adjust the motion so that it satisfies Newtonian physics and environmental contacts.

This involves a dynamic model of the human (often using an articulated rigid body model matching SMPL's skeleton, with masses, inertia, etc.) and possibly a ground plane. The optimizer in PIP corrects any physical violations – e.g., if a foot was predicted slightly below the ground or moving when it should be static, the physics module will adjust forces/torques to keep it planted.

If the initial motion implied non-zero acceleration of center-of-mass without sufficient ground reaction force, the physics correction will tweak the motion (and output forces) to obey momentum conservation. In essence, the second stage finds the closest physically valid motion to the network's output.

Because the network's output is already close to correct, the physics optimization can converge very fast (PIP manages 60 Hz total). The result is a pose sequence that not only fits the IMU data but also could be produced by a plausible set of forces and torques – i.e., it's physically simulatable.

PIP outputs the estimated joint torques and ground reaction forces as well, which are by-products of making the motion physical. This is valuable for biomechanics or VR applications requiring feedback forces.

**Performance:**

PIP achieved state-of-the-art accuracy on standard evaluations, with notable improvements in temporal stability and physical realism. For example, PIP's output for very long motions does not drift over time, unlike some purely learned models that might accumulate small errors.

Qualitatively, motions like running and jumping are handled more faithfully – the jumps have proper apex and landing dynamics, and running has realistic foot pushes. The authors reported that previous methods struggled with such long or dynamic motions, producing artifacts due to lack of physics.

PIP was even recognized as a Best Paper Finalist at CVPR 2022, indicating the significance of combining learning with physics.

Other Notable Models and Developments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In addition to the above primary methods, there have been other significant contributions in the IMU-to-pose literature:

**Fusion Methods:** Some works combine IMUs with other sensors (e.g., cameras). For instance, the TotalCapture project (Trumble et al. 2017) itself presented a method fusing multi-view video with IMUs to improve pose accuracy. More recent "fusion poser" approaches use deep networks to blend sparse IMUs with a smartphone camera feed or ambient sensors to enhance global positioning.

**High-Frequency and Low-Latency Models:** One work proposed a Fast Inertial Poser (2024) that simplifies the network to run on mobile devices in real time, fusing raw IMU streams with efficient temporal convolution (targeting >100 Hz on a phone).

**Handling Loose or Varied Sensor Placement:** A challenge is that IMUs might not always be tightly attached at known body landmarks. The Loose Inertial Poser (LIP) introduced methods to handle the misalignment and slippage of sensors on a loose garment (a jacket). By learning a calibration and compensation for sensor movement relative to the body, it achieves accurate pose with a more user-friendly setup.

**Commercial Systems and Extended Datasets:** Companies like Xsens (Roetenberg et al., 2007) have full-body IMU suits (with 17 sensors) that use proprietary sensor fusion algorithms for pose. These are not purely learning-based (they use extended Kalman filters and models), but provide ground truth data and set benchmarks. Datasets like TotalCapture and DIP-IMU have been used to evaluate new algorithms.

Problem Formulation and Learning Task Definition
--------------------------------------------------

Input and Output Representations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We define the human body model and sensor measurements formally. The SMPL model parameterizes a human body pose by a vector :math:`\boldsymbol{\theta} \in \mathbb{R}^{3K}`, where typically :math:`K=24` joints and each joint's rotation is given in 3-parameter axis-angle form. We can denote the pose as :math:`\boldsymbol{\theta} = [\boldsymbol{\theta}_1, \boldsymbol{\theta}_2,\dots,\boldsymbol{\theta}_K]` where each :math:`\boldsymbol{\theta}_j \in \mathbb{R}^3` is the axis-angle rotation of joint :math:`j` in the kinematic chain (including the root orientation).

The body shape is given by parameters :math:`\boldsymbol{\beta} \in \mathbb{R}^{N}` (often :math:`N=10` principal shape coefficients) which can be fixed or predicted, though most learning-based methods assume an average shape or calibrate :math:`\boldsymbol{\beta}` beforehand. The full 3D mesh or joint coordinates :math:`\mathbf{J}(\boldsymbol{\theta}, \boldsymbol{\beta})` can be obtained via SMPL's forward kinematics function.

An IMU sensor attached to a body segment provides two main readings:
1. The orientation of the sensor (often as a quaternion :math:`q \in \mathbb{H}` or rotation matrix :math:`R \in SO(3)`) with respect to the world frame
2. The linear acceleration :math:`\mathbf{a} \in \mathbb{R}^3` of the sensor in the world frame (often including gravity)

Formally, let there be :math:`M` IMUs on the body, each rigidly attached at a known location (e.g., left wrist, right wrist, etc.). At time :math:`t`, IMU :math:`i` provides an orientation :math:`R_{i}^{(t)}` (which transforms a vector from the sensor's local frame to the global inertial frame) and an acceleration vector :math:`\mathbf{a}_{i}^{(t)}` (usually measured in the sensor's local coordinates or converted to global coordinates).

The set of all IMU readings at time :math:`t` can be denoted as:

.. math::

   x(t) = \{R_1^{(t)}, a_1^{(t)}, R_2^{(t)}, a_2^{(t)}, \ldots, R_M^{(t)}, a_M^{(t)}\}

which is the input to our pose estimation system at time :math:`t`.

In practice, one might represent each orientation :math:`R_i` as a 4D quaternion or a 6D/9D rotation representation to feed into a network. Each acceleration :math:`\mathbf{a}_i` is a 3D vector. So the input :math:`\mathbf{x}^{(t)}` is a fixed-size vector concatenating all IMU data (e.g., for :math:`M=6` IMUs, if using quaternions, :math:`\mathbf{x}^{(t)} \in \mathbb{R}^{6*7}` since each IMU gives 4 (orientation) + 3 (accel) values).

The output of the learning-based model at time :math:`t` is an estimate of the body's pose (and possibly global position). This can be represented in various ways: the most direct is the SMPL pose parameter vector :math:`\hat{\boldsymbol{\theta}}^{(t)} \in \mathbb{R}^{72}`.

Some methods output the global translation :math:`\hat{\mathbf{p}}_{\text{root}}^{(t)} \in \mathbb{R}^3` for the root (pelvis) as well, if tackling global motion (TransPose, TIP do this). Others might output joint positions :math:`\hat{\mathbf{J}}^{(t)} \in \mathbb{R}^{3K}` directly in 3D space and then fit rotations to those (TransPose's intermediate stage).

For clarity, we consider the output to ultimately be the full pose and global position: :math:`\hat{\mathbf{y}}^{(t)} = [\hat{\boldsymbol{\theta}}^{(t)}; \hat{\mathbf{p}}_{\text{root}}^{(t)}]`. Some systems do not predict :math:`\mathbf{p}_{root}` (DIP left it undefined, effectively setting it to zero or initial position), but newer ones do.

Crucially, IMU-to-pose is inherently a sequence problem. A single time step's sensor readings :math:`\mathbf{x}^{(t)}` is often not sufficient to determine :math:`\mathbf{y}^{(t)}`; the context of previous and future frames helps disambiguate motion and gravity direction, etc. Therefore, the learning task is often defined over a time window or whole sequence.

We denote the input sequence :math:`\mathbf{X}_{1:T} = (\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(T)})` and the corresponding pose sequence :math:`\mathbf{Y}_{1:T} = (\mathbf{y}^{(1)}, \dots, \mathbf{y}^{(T)})`. The learning-based model can be seen as learning a function :math:`f_\Theta` (with parameters :math:`\Theta`) such that:

.. math::

   \hat{\mathbf{Y}}_{1:T} = f_\Theta(\mathbf{X}_{1:T})

Often this factorizes in time (e.g., producing one frame at a time with an RNN), but the key is it uses the sequence as input to produce a sequence of outputs.

Learning Objective and Loss Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

During training, we have ground-truth pose sequences (obtained from a high-quality system or simulation) for the training sequences of IMU data. Let :math:`\mathbf{Y}_{1:T}^{*}` be the ground-truth poses. We define a loss function :math:`L(\hat{\mathbf{Y}}, \mathbf{Y}^*)` that measures the error of our predicted sequence. Common loss terms include:

**Pose (rotation) loss:** This term directly penalizes errors in joint rotations. For example, one can convert each predicted joint rotation and ground-truth rotation to axis-angle and compute the angle difference. If :math:`\mathbf{R}_j^{(t)}(\hat{\boldsymbol{\theta}})` is the rotation matrix for joint :math:`j` from predicted pose and :math:`\mathbf{R}_j^{(t)}` is ground truth, an orientation loss can be:

.. math::

   L_{\text{orient}} = \frac{1}{KT}\sum_{t=1}^T \sum_{j=1}^K \angle(\mathbf{R}_j^{(t)} \mathbf{R}_j^{(t)T})

where :math:`\angle(R)` denotes the angle of the rotation :math:`R`. This essentially sums the per-joint angular error in degrees or radians.

Alternatively, one can use quaternion distance or even easier, an L2 loss on axis-angle components (though care is needed around :math:`2\pi` wrap-around). Some methods simplify and just do an L2 loss on the pose parameter vector:

.. math::

   L_{\text{pose}} = \frac{1}{T}\sum_t |\hat{\boldsymbol{\theta}}^{(t)} - \boldsymbol{\theta}^{*(t)}|^2

**Joint position loss:** Because ultimately we care about where body parts are, a popular loss is on 3D joint coordinates. If the model :math:`M(\boldsymbol{\theta}, \boldsymbol{\beta})` produces joint positions, we can penalize:

.. math::

   |M(\hat{\boldsymbol{\theta}}^{(t)}, \hat{\boldsymbol{\beta}}) - M(\boldsymbol{\theta}^{*(t)}, \boldsymbol{\beta}^*)|^2

summed over joints. This helps when small orientation errors of a proximal joint lead to big position errors at the extremities (the loss will directly account for that). DIP and others often included a position loss in addition to orientation loss to better guide the network.

**Global translation/position loss:** If the method predicts root translation, one can include an L2 loss on the root's position:

.. math::

   L_{\text{trans}} = \frac{1}{T}\sum_t |\hat{\mathbf{p}}_{\text{root}}^{(t)} - \mathbf{p}_{\text{root}}^{*(t)}|^2

This was relevant for TransPose (they likely trained on some data with known trajectories, possibly synthetic or from optical motion capture like TotalCapture which provides global positions).

**Velocity and acceleration loss (smoothness):** To encourage temporal consistency, loss terms can penalize differences between successive frames. For example, a velocity loss:

.. math::

   L_{\text{vel}} = \frac{1}{T-1}\sum_t |\hat{\mathbf{J}}^{(t+1)} - \hat{\mathbf{J}}^{(t)} - (\mathbf{J}^{*(t+1)} - \mathbf{J}^{*(t)})|^2

which ensures the predicted motion increments match the ground truth increments.

If ground truth velocities are not known, one might add a smoothness regularizer:

.. math::

   \sum_t |\hat{\mathbf{J}}^{(t+1)} - 2\hat{\mathbf{J}}^{(t)} + \hat{\mathbf{J}}^{(t-1)}|^2

to minimize excessive jitter (this is a second derivative (acceleration) penalty).

**Contact or consistency constraints:** Recent learning approaches incorporate specific constraints via loss. For instance, one can detect foot contact in ground truth (or define it via velocity threshold) and then add a foot velocity loss for those frames: if foot should be planted, penalize any predicted foot movement.

TIP's notion of Stationary Body Points (SBP) was actually integrated as a target; one could train a network to output a binary contact flag and use a loss against known contacts, or encourage predicted contacts to have zero velocity by a soft constraint.

Another consistency constraint is sensor reconstruction loss: since we know the input IMU orientation, one could require that the predicted pose, when fed into the forward model, reproduces the sensor readings.

Typically, the overall loss is a weighted sum of such terms:

.. math::

   L = w_{pose}L_{pose} + w_{pos}L_{pos} + w_{vel}L_{vel} + w_{cons}L_{consistency} + \ldots

The weights are tuned to balance units and importance (for instance, degrees vs meters, etc.).

DIP's probabilistic approach can be seen as having the network output an uncertainty per joint angle :math:`\sigma_j^2`, and then the loss is:

.. math::

   L = \frac{1}{T}\sum_{t,j} \frac{|\hat{\theta}_j^{(t)} - \theta_j^{*(t)}|^2}{\sigma_j^2(t)} + \log \sigma_j^2(t)

which implicitly adjusts weights :math:`w` during training. This is an advanced strategy to handle multi-modal outputs.

Temporal Modeling Approaches
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The mapping from a sequence of IMU readings to a sequence of poses is highly temporal in nature. Three main approaches have been used to model time in learning-based solutions:

**Recurrent Neural Networks (RNNs):** This includes LSTM and GRU based models. An RNN processes one frame at a time, carrying forward a hidden state :math:`\mathbf{h}^{(t)}` that encodes past information.

A typical setup for IMU pose: at each time :math:`t`, feed :math:`\mathbf{x}^{(t)}` (IMU features) and the previous hidden state :math:`\mathbf{h}^{(t-1)}` into an LSTM, output a new hidden state and the pose estimate :math:`\hat{\mathbf{y}}^{(t)}`. RNNs naturally handle arbitrary sequence lengths and can be run online.

DIP used a bi-directional RNN during training, which means it had one RNN reading forward (1→T) and one backward (T→1), and combined their hidden states for the output. This gives the network future context (improving accuracy) at the cost of only working offline.

DIP's solution was to train with bi-RNN (for maximum learning of temporal patterns), but deploy with a forward-only RNN for real-time. TransPose used a multi-layer RNN in its translation module (to fuse foot heuristic with learned motion). RNNs are good with continuous streaming data and have relatively low computational cost per frame.

**Sliding windows / temporal convolution:** Another approach is to consider a fixed-size window of :math:`W` frames around time :math:`t` and use a feed-forward network (or 1-D convolution) to map that window to the pose at the center.

For example, one could take frames :math:`t-K` to :math:`t+K` as input (concatenated or as a sequence into a small CNN) and output :math:`\hat{\mathbf{y}}^{(t)}`. This was not explicitly used in DIP (they opted for RNN), but some works (e.g., a baseline DIP tried and found less effective) might have attempted a fully connected network on a window. 

Temporal convolutional networks (TCN) could also be applied: convolving on the time axis to produce a smoothed, latency-controlled output. The advantage of window methods is parallel processing (process all frames in a sequence in parallel if using convolution), and the ability to incorporate some future context (depending on window). The downside is fixed latency (you need future frames for center). DIP's sliding window deployment with an overlap is a variant of this: they might run the bi-RNN on a window of, say, 1 second, and output the first half of that window, then slide.

**Transformers:** As exemplified by TIP, Transformers use self-attention to model long-range dependencies in a sequence of sensor data. A Transformer's encoder could take the entire sequence of IMU readings (or a large chunk) and globally attend to any time step when predicting the pose at a certain time.

TIP specifically used a Transformer decoder framework where the model iteratively produces poses auto-regressively, attending to past outputs and inputs. This allows flexible context length – potentially the model can consider very long histories (hundreds of frames) if needed, which might capture slow drift trends or repetitive cycles.

Transformers often require more data to train (due to their many parameters and lack of built-in inductive bias for smooth temporal progression like RNNs have) and careful positional encoding (to know the order of frames). TIP overcame data scarcity by training on both synthetic and real sequences and incorporating the history via the decoder mechanism.

The benefit of Transformers is the rich modeling of complex time relationships (they might learn, for example, that a certain IMU pattern 10 seconds ago combined with the current pattern implies something now – an RNN with limited memory might forget). The cost is computational: self-attention is :math:`O(T^2)` for sequence length :math:`T`, which can be an issue for long sequences or high-frequency data unless windowed attention is used. TIP managed real-time, suggesting they either constrained :math:`T` or optimized the model well.

In summary, RNNs (especially bi-directional or stacked) have been the workhorse for temporal modeling in early works (DIP, TransPose), whereas Transformers are emerging for their strength in capturing long-range constraints (TIP). Some methods might also combine approaches; e.g., TransPose uses an RNN for translation estimation but a feed-forward for pose after a hierarchical pass.

Supervised vs. Semi-Supervised Training; Synthetic Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Training a learning-based model for this task can be fully supervised if we have ground-truth pose sequences for our IMU inputs. As discussed, obtaining ground-truth usually means using an expensive system (like an optical mocap with markers or a dense IMU suit or both) to record a few subjects, or using simulated data.

DIP and successors have leaned heavily on synthetic training data generated from mocap databases like AMASS. The process involves taking poses from these databases and computing what a set of IMUs would measure if the person was doing that pose sequence. Since the pose sequence is known (from mocap), this automatically provides the supervision: the algorithm is trained to map the synthetic IMU readings back to the original known pose.

This approach vastly increases training sample size and motion diversity – e.g., AMASS aggregates 18 motion datasets, over 40 hours of motion, including complex actions. The challenge is domain mismatch: simulated IMU readings may differ from real IMUs. Simulated data assumes a perfect sensor (no noise, no bias drift, exact alignment to body). Real IMUs have sensor noise, calibration offsets, magnetic disturbances, etc.

To bridge this gap, methods do a few things:
1. Add realistic noise to synthetic IMU signals during training (e.g., Gaussian noise on orientations, drift perturbations) to teach the network to be robust
2. Use domain adaptation or fine-tuning on a smaller set of real data

DIP, for example, fine-tuned their model on the DIP-IMU real dataset for a few epochs to adapt to real sensor characteristics. This significantly improved accuracy on real test motions compared to using the pure synthetic-trained model.

Semi-supervised or self-supervised training becomes useful if we have a lot of real IMU sequences without ground-truth poses. One can incorporate those by using losses that do not require ground truth pose. For instance, one can train a model with a mix of supervised loss (on synthetic or the few labeled real) and unsupervised consistency loss on unlabeled real.

A common unsupervised loss is the reconstruction loss of sensor signals: we mentioned above, ensure the predicted pose re-generates the IMU readings. If a network is good, feeding its output pose back into the forward model should produce the same orientation/accel that was input. By enforcing this on real data, the network learns from real sensor patterns even without knowing the exact pose – essentially it's learning to solve the inverse problem by trying to be a consistent inverse of the forward physics.

Physics-based constraints can also help self-supervise: for example, an unlabeled IMU sequence might have obvious periods of rest – the network could be trained to recognize and enforce zero-velocity on predicted motion during those periods (since if accelerometers read ~9.81 m/s² with little variation, likely the person is static; the network should output a static pose).

Some recent research explores Physics-Informed Neural Networks (PINNs) for IMU pose, where the loss includes physical equations (like equations of motion) on unlabeled sequences.

Another approach is synthetic fine-tuning: using simulated but in-the-loop refinement. For instance, one could train a network on synthetic data, then deploy it on some real data and use the agreement between forward-simulated sensors and actual sensors as a cue to update the model (a form of self-calibration).

In practice, DIP, TransPose, TIP all use predominantly supervised learning with synthetic data plus limited real data fine-tune. PIP's first stage (kinematics network) was trained on synthetic data (likely similar to DIP).

What made these works succeed is the realization that AMASS + simulated IMUs provides effectively infinite training data of diverse actions. The AMASS dataset (released 2019) was a boon: DIP (2018) had to collect smaller mocap sets themselves; later works could directly draw from AMASS's 343 motion subjects and huge action variety.

Of course, careful alignment of coordinate frames between simulation and device is needed (e.g., aligning the virtual world frame's gravity to match IMU convention that z-axis points opposite gravity).

In summary, supervised learning on synthetic data is the current standard approach, with fine-tuning on real data to handle domain shift. Semi-supervised ideas hold promise to further leverage unannotated real IMU data by imposing that the model's outputs obey physical measurement laws and consistency. This could reduce reliance on expensive motion capture labeling in the future.

Model Architectures and Design Considerations
------------------------------------------------

Encoding IMU Measurements
^^^^^^^^^^^^^^^^^^^^^^^^^

The first step in the model is to convert raw IMU streams into a suitable ML input. Each IMU's orientation can be represented in several ways:

**Quaternions (4D):** A straightforward encoding of orientation is a quaternion :math:`(w, x, y, z)` representing the rotation from some reference frame to the sensor frame. Quaternions need to be normalized (unit length). Networks can handle this, but sometimes learning to maintain normalization is tricky. Often quaternions are fed as is, and one might enforce normalization either by an explicit layer or by normalizing the network's output if it ever predicts orientations.

**Rotation matrices (9D):** We can also feed the 3×3 rotation matrix elements (9 numbers). This is redundant (only 3 DoF are true degrees of freedom) and also has orthonormal constraints. But in practice some have used 9D with a normalization step.

**6D continuous representation:** Zhou et al. (2019) proposed representing rotation by two 3D vectors, corresponding to the first two columns of the rotation matrix, which are then normalized and the third is their cross product. This 6D representation has no singularities and networks can output any 6 numbers which will map to a valid rotation. It's common in vision tasks and could be applied here too. Some recent IMU works likely use it internally for output angles; for input, since IMU orientation is known exactly, one might not bother converting it – quaternion is fine.

**Euler angles (3D):** If one picks a consistent e.g., ZXY Euler angle convention, they could use 3 numbers from the IMU orientation (pitch, roll, yaw). However, Euler angles have discontinuities (gimbal lock, angle wrapping). It's generally avoided as a direct learning input to a neural net in favor of quaternions or 6D.

For accelerations, typically the linear acceleration vector (with gravity subtracted or not) in the global frame is used. One subtlety: IMUs often give acceleration in the sensor's local frame. To express it in a fixed world frame (like SMPL's global coordinates), one multiplies by the orientation.

Many methods transform accelerations to a body-centric frame instead for learning. For example, DIP normalized the inputs by rotating all sensor orientations and accelerations into the root IMU's frame or a reference frame tied to the body. This removes global orientation as a factor – the network doesn't have to learn invariance to the person facing north vs east; it only cares about pose relative to whatever direction the root is facing.

In practice, one can take the pelvis IMU orientation :math:`R_{\text{pelvis}}` and use it to invert-transform all other orientations: :math:`\tilde{R}_i = R_{\text{pelvis}}^{-1} R_i`, so that pelvis becomes identity orientation. Similarly transform accelerations to pelvis frame. This way the network mostly sees relative orientations of limbs to pelvis (which correlates directly to joint angles) and the pelvis orientation itself (which indicates global heading).

DIP mentions careful treatment of coordinate systems as a crucial step. Failing to do this can cause the network to struggle – e.g., it would have to learn that an IMU orientation of (0.7, 0, 0.7, 0) for the pelvis might correspond to the same pose as (1,0,0,0) just rotated in yaw.

Another encoding is to break the acceleration into gravity + linear. An IMU actually measures :math:`acc_{\text{measured}} = -g \hat{z}_{\text{world to IMU}} + acc_{\text{linear}}`. Some networks feed the acceleration as two parts: the gravity vector (which is basically the orientation's down-axis) and the short-term linear acceleration (due to motion). But usually the orientation already contains gravity direction information, so it may be redundant.

Additionally, one can include angular velocity (gyroscope) readings if available. In principle, IMUs give orientation (from an internal filter) and also raw gyro which indicates rotation speed. Most methods focus on orientation+acc only, since orientation is essentially integrated gyro and is less noisy. But including angular velocity might add high-frequency motion clues.

Few papers explicitly mention using gyro or magnetometer; DIP did not use magnetometer (which provides compass heading) explicitly, though the orientation likely came from a fusion that uses it.

Finally, each time step's multi-IMU data is often flattened and concatenated to feed into an RNN or transformer. For an RNN, one can also feed them in a structured way (like processing each IMU with a sub-network then merging).

DIP's network design included an initial per-sensor fully-connected layer to embed each sensor's data to a feature vector, then concatenating those. This is analogous to treating each IMU as a "channel." Some architectures could also use convolution across sensors (if sensors were ordered in a kinematic order). However, typically a simple concatenation suffices since the number of sensors is small and fixed.

Network Structures
^^^^^^^^^^^^^^^^^^^^^^

**Feed-forward vs. recurrent vs. attention:** We discussed temporal modeling; here we consider the overall network topology. DIP's architecture was a stacked bi-directional GRU network. It had multiple layers of GRU cells – the output of one GRU goes into the next GRU (this allows higher-level features to be extracted as you go up layers).

The bi-directional nature means one GRU layer processes from frame1→frameT, another processes from frameT→frame1, and their outputs are concatenated. On top of the last GRU, a final fully-connected layer maps to the pose output.

The DIP network was not extremely deep (maybe 2 layers of GRU with a few hundred hidden units each), but it was enough to capture the temporal patterns.

TransPose's pose network was multi-stage but each stage might be smaller (perhaps a couple of dense layers). Its translation network was an RNN (probably an LSTM) that integrated velocity. TIP's network is a Transformer: presumably an encoder that takes the IMU sequence and a decoder that outputs pose sequence auto-regressively. That means the architecture had the typical transformer blocks (multi-head self-attention, feed-forward sublayers). The query of the decoder at each time could be the previous pose plus positional info, and the decoder cross-attends to the IMU encoder output. This is a bit complex architecture but conceptually it replaces the recurrence with self-attention.

A common pattern is the use of dense intermediate layers to expand or reduce feature size. For example, DIP likely had a dense embedding of each time step's sensor data to, say, 128D, then the GRU's hidden state maybe 256D, etc., and a final output layer. These fully connected layers can be seen as the network learning an appropriate representation of the sensor signals (for instance, computing relative orientations or angles).

One interesting component in TransPose is the leaf joint prediction. This was implemented as an intermediate supervised output of joint positions. We can interpret it as a form of hierarchical decoding: the network might first output 3D positions of hands, feet, head. Then in a next layer (or next stage of the network), it uses those and the IMU data to predict the next set (like elbows, knees, etc.), ending with the root.

This hierarchical approach ensures that end-effectors (which are directly influenced by IMU on that segment) are correct first, then internal joints (which are more ambiguous from IMUs alone) can be inferred given where the limbs ended up. Implementing this requires either a multi-output architecture or sequential modules.

The TransPose authors describe it as a "multi-stage network", which implies they had perhaps one sub-network focusing on leaves, then fed its results into another sub-network for the rest. They likely supervised the intermediate predictions with the ground-truth joint positions (a kind of deep supervision to guide each stage).

Most networks output the pose for each frame independently given the context. However, one can include feedback loops: TIP feeds back previous outputs explicitly in the decoder, and even DIP's RNN effectively feeds back its hidden state which contains info about previous outputs.

**Output decoding:** Finally, how does the network output map to a valid pose? If the network outputs :math:`3K` numbers as joint rotations in axis-angle, that is taken as is (the network must learn to output reasonable values, typically it will because it's trained to minimize rotation error).

Sometimes a network might output a rotation in a form that isn't guaranteed valid (say it outputs 3 numbers intended as axis-angle but those can represent an angle beyond :math:`\pi` which might be interpreted differently). Usually this is handled by the loss rather than by constraints on output. If a network outputs quaternions for joints, one would normalize them. If it outputs a rotation matrix (like some frameworks do via 9D rep), one would use a differentiable orthonormalization.

TransPose's approach of outputting joint positions means they needed an extra step to retrieve joint angles. Possibly they solved an inverse kinematics (IK) problem: given predicted positions of hands, feet, etc., find joint angles that place those correctly. They might do this via a least-squares solve or incorporate a differentiable IK layer. But since they said it's multi-stage network (which is within the learning model), they might have implicitly learned to produce rotations that match those positions by design. It's a bit unclear, but likely they had differentiable kinematics in the loop.

**Global translation output:** If predicting root translation, one simple method is to predict the root's velocity at each frame (in the horizontal plane and vertical). TransPose did a combination of foot heuristic and RNN for this. TIP predicted stationary points which indirectly gives translation once you align those stationary points over time.

A network could also output the root position directly, but it might be better to predict velocity to avoid unbounded error (since velocity can be integrated and networks are good at local increments). In training, however, having ground-truth positions allows direct supervision.

**Network size considerations:** DIP's network was small enough to run on a CPU in real-time (though GPU was used for training). TIP's transformer was likely larger, but they may have optimized sequence length (real-time demo indicates it was feasible). PIP's network purposely was kept small to keep the 16ms latency – because after the network, they still do a physics solve.

PIP's network might be a single LSTM layer or even a feed-forward that looks at a short window of IMU data to output the current pose (since the physics will correct any long-term issues). Indeed, PIP's paper mentions the kinematics module regresses "motion state" – possibly meaning pose plus velocities – which a simple network can do in one shot.

In general, designing the network architecture involves deciding: how much temporal history to use, how to encode the input features (taking care of coordinate frames), whether to incorporate any physics priors inside (some have tried to hardcode gravity subtraction or known limb lengths, etc. in the network itself), and how to ensure outputs are valid (often by choosing appropriate representation or by adding losses that keep them in check).

Training Pipeline and Pseudocode
--------------------------------

Bringing all the pieces together, we outline a basic pipeline for training a learning-based IMU-to-pose model:

**Data Preparation:**

1. Collect or generate a set of training sequences. For example, use AMASS to get many pose sequences :math:`{\boldsymbol{\theta}}_{1:T}`.

2. For each sequence, simulate IMU readings. For each time frame, compute sensor orientations and accelerations from the pose. Add noise if desired. Store pairs of :math:`(\mathbf{X}_{1:T}, \mathbf{Y}_{1:T})` where :math:`\mathbf{Y}_{1:T}` are the ground-truth SMPL poses (and optionally root translations).

3. Split into training, validation sets. Also prepare any real sequences for fine-tuning or testing (e.g., DIP-IMU, TotalCapture sequences with ground truth).

**Model Initialization:**

1. Design the neural network model :math:`f_\Theta`. Choose RNN/Transformer etc., set input size = :math:`M` sensors * (orientation+accel dim), output size = pose (72) + maybe root vel/pos (3).

2. Initialize weights :math:`\Theta` (random or Xavier initialization, etc.). If using a two-stage (like PIP), initialize both the kinematics network and prepare the physics module parameters (which might not have trainable weights if it's analytic).

**Training Loop (Supervised):**

.. code-block:: python

    for epoch in range(num_epochs):
        for each batch of sequences in training_data:
            # Prepare batch data
            X_batch, Y_batch = batch_sequences()  # shape [batch, T, features] and [batch, T, output_dim]
            
            # Forward pass: 
            Y_pred = model(X_batch)  # run the network on each sequence
            
            # Compute loss:
            loss_pose = MSE(Y_pred['pose'], Y_batch['pose'])
            loss = loss_pose
            if predict_root:
                loss_trans = MSE(Y_pred['root_pos'], Y_batch['root_pos'])
                loss += alpha * loss_trans
            # (Add other losses if defined, e.g., velocity, contact)
            
            # Backpropagation:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        # (Optional) Validate on val set, adjust learning rate, etc.

This pseudocode depicts a simple supervised training. In practice, sequence batches can be handled by unrolling RNNs or by padding sequences to a common length. The loss here is just MSE (mean squared error) as a placeholder for potentially the sum of all relevant terms (orientation differences, etc.). alpha is a weight for translation loss.

**Fine-tuning / Domain Adaptation (if applicable):** After initial training on synthetic data, one can fine-tune on a smaller real dataset:

.. code-block:: python

    for epoch in range(few_epochs):
        for each batch in real_data:
            X_real, Y_real = batch_real()
            Y_pred = model(X_real)
            loss = MSE(Y_pred, Y_real)  # real ground truth from DIP-IMU or TotalCapture
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

This step uses a smaller learning rate to not forget the general learned motion prior, but adapt to real sensor noise and biases.

**Inference (Online Deployment):** Once trained, the model can be used on new IMU data in real time:

.. code-block:: python

    # Initialize hidden state for RNN if needed
    h = None  
    for each new frame t:
        x_t = read_IMUs()          # get current IMU orientations & accels
        x_t = normalize_frame(x_t) # apply coordinate transforms, normalization
        y_t, h = model.predict(x_t, h)  # RNN: provide previous hidden state
        # y_t contains pose (and possibly root translation)
        visualize_pose(y_t)        # or send to application

In a transformer model deployed auto-regressively, the prediction might be done one step at a time as well, feeding back the last few poses as context. In an RNN, the hidden state h carries the needed history compactly. The normalize_frame would, for example, rotate the input so that the pelvis IMU orientation is identity (using an initial calibration stance as reference perhaps).

**(Optional) Physics Correction at Inference:** If using a hybrid like PIP, after obtaining a window of poses (say last :math:`N` frames from the network), one would run a physics solver:

.. code-block:: python

    pose_sequence = get_last_N_predicted_poses()
    pose_sequence_corrected = physics_optimize(pose_sequence, IMU_measurements_last_N)
    output(pose_sequence_corrected[-1])

This optimization typically adjusts the sequence to better match the recent measurements and physical laws, and returns the corrected current pose.

During training, monitoring metrics like orientation error (in degrees) per joint or position error (in mm) for key joints can guide development. A common metric in literature is the Mean Per Joint Position Error (MPJPE) in millimeters, sometimes after aligning the root.

On DIP-IMU and TotalCapture, for instance, methods report MPJPE to compare accuracy. The pseudocode above glosses over details like batching variable-length sequences (often handled by packing in PyTorch LSTMs) and assumes fully supervised training. In a semi-supervised scenario, one might have an inner loop where for unlabeled data you only use reconstruction loss: e.g., train to minimize difference between input IMU orientation and orientation derived from predicted pose.

Datasets, Benchmarks, and Resources
-----------------------------------

To develop and evaluate IMU-to-pose methods, several datasets and benchmarks are used:

DIP-IMU Dataset (2018)
^^^^^^^^^^^^^^^^^^^^^^^^^^

Introduced with Deep Inertial Poser, DIP-IMU is one of the first large-scale datasets for sparse IMU human motion capture. It contains 10 subjects (9 male, 1 female) each performing a variety of motions in 5 categories (walking, running, sports, etc.), recorded with a full Xsens suit of 17 IMUs at 60 Hz.

In total it has 64 sequences comprising about 330,000 time instants of data. Each frame has the 3D orientation (quaternion) of each IMU (from Xsens's on-board EKF) and the raw accelerometer readings.

Ground-truth poses (3D joint angles) were obtained via a high-end optical motion capture system simultaneously recorded, and then post-processed to fit the SMPL model. DIP-IMU is publicly available for research – the project page provides a download link after registration.

This dataset has become a standard benchmark: SIP's optimization was tested on it, and DIP used it for validation. It enables researchers to test their algorithms on common motions and compare error metrics (e.g., mean joint error in degrees or centimeters).

The DIP-IMU dataset's size (over 300k frames) also made it suitable for training deep networks, and indeed DIP and subsequent methods (TransPose, etc.) train on a mix of DIP-IMU and synthetic data.

(URL: http://dip.is.tue.mpg.de)

TotalCapture (2017)
^^^^^^^^^^^^^^^^^^^^^^^

The TotalCapture dataset, released by Trumble et al., is a multimodal motion capture dataset that includes synchronized video, IMU, and Vicon marker data for human motions. It features 5 subjects (4 male, 1 female) each performing four distinct sets of actions (ROM exercises, walking, acting, and freestyle) with each sequence repeated 3 times.

The dataset provides data at 60 Hz from 8 calibrated cameras and 120 Hz data from a set of 13 IMUs (the subjects likely wore a Motion Analysis or Xsens set covering most body segments). With nearly 1.9 million frames of synchronized data, TotalCapture was the first dataset to offer IMU data aligned with ground-truth 3D poses (from a Vicon optical system) on such a large scale.

Researchers have used TotalCapture to evaluate IMU-only pose estimation as well as fusion of video and IMUs. For instance, the SIP paper compared their 6-IMU method against a baseline on TotalCapture sequences. The DIP authors also utilized TotalCapture by fitting SMPL to the Vicon data to create reference poses for testing their network.

The dataset can be obtained by request from the University of Surrey's website (it requires a signup due to its size and to agree to usage terms). TotalCapture provides diverse indoor motions and challenging freeform activities (like acting out scenarios) which test an algorithm's generalization.

It is particularly useful for methods that combine visual and inertial data, but also for pure-inertial methods that can use the IMU streams and the "ground truth" SMPL pose fits provided by DIP authors for quantitative evaluation.

(URL: https://cvssp.org/data/totalcapture/)

AMASS (2019)
^^^^^^^^^^^^^^^^

The Archive of Motion Capture as Surface Shapes (AMASS) is a large collection of mocap datasets unified into a common format of human model parameters. AMASS gathers 15 different mocap datasets (recorded with marker-based systems) and fits the SMPL (and SMPL+H for hands) model to all of them using the MoSh++ algorithm.

The result is a dataset of 11,000+ motions, over 40 hours of data from more than 300 subjects, all represented consistently as sequences of SMPL pose and shape parameters.

AMASS does not contain IMU data per se, but it has been hugely beneficial for IMU research because one can simulate IMU measurements from the AMASS motions. For example, given a sequence of SMPL poses from AMASS, one can compute the orientations and accelerations of virtual IMUs placed on the SMPL body (this is exactly what DIP and others do to generate training data).

The TransPose repository's preprocessing script demonstrates this: it takes AMASS sequences, assumes 6 IMU placement as in SIP, and computes "synthetic" IMU sensor data (orientation and acceleration) for each sequence.

This synthetic data can be used to train networks so that they don't overfit to the specific motions of DIP-IMU or TotalCapture, and it covers a far broader range of movements (since AMASS includes data from CMU MoCap, Human3.6M, gait datasets, etc.).

AMASS is accessible for research; users must register on the AMASS website and can then download the parameter files for the various sub-datasets. By using AMASS, one can also compute pose priors – many optimization methods (including SIP) leverage the fact that AMASS provides a distribution of typical human poses and shapes.

In summary, while AMASS is not an IMU dataset in itself, it is an invaluable resource to generate data for algorithm development and to serve as a prior on human motion.

(URL: https://amass.is.tue.mpg.de)

Other Datasets and Resources
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A number of other datasets and tools are worth mentioning:

- The TNT15 Dataset referenced in the SIP paper refers to a motion capture dataset from Tübingen (likely containing various motions) that was used as a baseline; it may be available through MPI.
- More recently, researchers have begun collecting IMU data in outdoor or clinical settings: for example, IMUPoser (CHI 2023) used smartphone IMUs to capture daily activities, and MM-fit dataset provides IMUs for workout motions.
- The KIT Motion Dataset and MPI Limitations Dataset have also been used for evaluating how methods handle extreme or unusual motions.

On the software side, many open-source implementations for IMU pose estimation exist:
- The original SIP code (in C++) was available as a research prototype, and there are re-implementations like the one by Yan et al. in the Fusion Pose project.
- The DIP authors released their PyTorch code and pretrained model on the DIP project page, allowing researchers to directly use the DIP network for comparison.
- The TransPose project (2021) provides code that includes not only a neural network but also an example of using a solver (Ceres) to refine global pose with IMUs, giving a practical example of combining learning with optimization.

When developing an optimization-based method, one can use these datasets to test: for instance, start by fitting a single frame's pose to one frame of DIP-IMU orientation data (which should be similar to solving a PSO problem per frame), then extend to a window of frames to incorporate acceleration.

By leveraging these datasets, the community has established evaluation protocols – common metrics include MPJPE (Mean Per Joint Position Error) in millimeters between the estimated pose and ground-truth pose, as well as orientation errors in degrees for each joint.

For example, on DIP-IMU, DIP (the neural method) reports an average joint position error around 60~80mm, whereas SIP (optimization) achieves around 100mm on certain motions, illustrating the trade-off between real-time inference and global accuracy.

Datasets like TotalCapture allow comparisons against vision-based methods: e.g., combining IMUs with video can reduce errors compared to video-only, especially for occluded limbs. Overall, the availability of data and code has greatly accelerated research, enabling more advanced techniques like hybrid model-learning methods and domain adaptation for different sensor configurations.

Challenges and Outlook
--------------------------

Despite significant progress, several challenges remain in fitting SMPL to IMU data, and ongoing research is exploring solutions:

**Generalization to Unseen Motions:** A model trained on certain activities might struggle with very different ones. For instance, a network trained mostly on locomotion (walking, running) might have difficulty with acrobatic moves or crawling, which involve unusual sensor readings.

Ensuring motion diversity in training data is key (hence using AMASS). The model also needs capacity to represent very different poses (from standing upright to upside-down poses). Future work could involve adaptive models or mixture-of-experts that handle different motion regimes, or continued pre-training on enormous motion datasets to be more universal.

**Subject Generalization and Shape:** Most learning approaches assume a generic or average body shape. If a person is very tall or short or has different limb lengths, the same IMU orientation might yield different joint angles (because limb length differences alter the relation of sensor orientation to posture).

Optimization approaches inherently handle individual anthropometry (by calibrating the model's shape :math:`\boldsymbol{\beta}`). For learning methods, one could feed body shape as an additional input to the network (if one could estimate it from, say, the distance between some sensors).

Alternatively, one can calibrate shape by a short sequence: e.g., have the person do a T-pose or some known motion, and optimize :math:`\boldsymbol{\beta}` so that the network's output fits the sensor data of that known pose.

Research in joint shape-pose estimation from IMUs is nascent – some works use foot-to-hip distance observed during walking to infer leg length. Future systems might include a calibration phase where the network quickly estimates the user's body shape or even fine-tunes to the user's data (which crosses into personalizing the model).

**Drift and Cumulative Error:** Without external references, global position estimates will always have some drift. TransPose's foot locking and TIP's stationary point approach greatly mitigate drift, but over very long periods (minutes of continuous movement), small errors can still accumulate.

One outlook is to integrate occasional zero-velocity updates or similar to reset drift – in pedestrian tracking, it's known that if a person stops occasionally, one can reset velocity to zero. IMUs also typically have magnetometers for compass heading, which could be used to prevent heading drift (most methods didn't explicitly use magnetometer, but it's a piece of info that could stabilize global orientation over long times).

If environmental feedback can be obtained (e.g., recognizing the person returned to their start point), that could correct drift. In absence of any external signal, physics-based methods like PIP ensure physical consistency but cannot know global location drift if the entire motion floats in one direction slightly off – they would need some assumption (like flat ground and eventually you must come to rest).

**Real-Time and Low-Power Implementation:** For truly wearable systems (say an app on a smartphone reading IMUs from wearable sensors), computational efficiency is crucial. Models like DIP and TransPose are light enough for a modern smartphone GPU. Transformers might be heavier; one can use quantization or simpler architecture for deployment.

Some research (as mentioned with Fast Inertial Poser) is looking at pruning models to run on microcontrollers in the IMU units themselves. Balancing model complexity with latency and battery life will be important for practical systems (e.g., VR suits, sports analytics devices).

**Robustness to Sensor Errors:** IMUs can produce faulty data: magnetic distortion can throw off orientation, accelerometers saturate on impact, sensors may disconnect briefly. A robust pose estimator should handle missing data or outliers.

This can be addressed by filtering the input (e.g., smoothing the IMU signals, or using sensor fusion outputs that are already filtered). Neural nets could also learn to infer a missing sensor's information from others (e.g., if one sensor fails, perhaps the network can still guess pose from the remaining).

Designing redundancy or using an odd number of sensors (say 7th sensor as a redundant one that the network uses only if needed) could be explored.

**Multi-person and Interaction:** All methods discussed assume one person's IMUs, independent of others. In scenarios with multiple people each wearing IMUs, there's the potential to confuse signals if not properly labeled.

Also, physical interaction between people (like two people hugging or carrying each other) poses new challenges – the IMUs don't directly tell if an external force/contact from another person is happening, which could lead to impossible poses if solved individually.

This suggests future directions in joint pose estimation from multiple subjects' IMUs, possibly using a combined physics model (like simulating two bodies that can exert forces on each other and fitting that to IMUs).

**Combining Learning with Physics (Outlook):** PIP has shown one way to combine them (serially). Another outlook is to integrate physics within the learning process. For example, one could have a differentiable physics engine (there are libraries for that) and train the neural network's outputs to not just match pose ground truth but also to minimize physical violations.

This would inject physics knowledge into the network during training itself, potentially resulting in a model that at runtime naturally outputs physically plausible motions without needing a second stage.

One could also learn the physics parameters (like ground friction, or personalized mass distribution) in a parallel stream. We might see Physics-informed Neural Networks (PINNs) for human motion, where the loss function includes terms for Euler-Lagrange equations or momentum conservation.

Already, a self-supervised PINN was proposed for IMU pose that estimates dynamics without ground truth forces.

**Integration with Other Sensors:** While pure IMU is appealing for its independence, in practice there's growing interest in hybrid systems. For example, a system might use a smartwatch IMU plus occasional camera images from a phone to reduce drift, or use ultra-wideband (UWB) radio beacons for positional references while IMUs do the pose.

Such combinations could be handled by learning (e.g., an RNN that takes both IMU and UWB ranges to output global position more accurately). From an outlook perspective, IMU-based pose is likely to be a component in larger VR/AR systems where additional cues (like foot pressure insoles, etc.) are available. Each additional modality can be fused via learning.

**Better Losses and Uncertainty:** The heteroscedastic uncertainty approach DIP used could be extended: models could output full probability distributions over ambiguous joint angles. For instance, imagine an IMU on the back cannot tell if the person's arms are raised or down (if arms had no sensors) – a model might output two modes.

Currently, networks typically output a single best guess. In the future, probabilistic pose outputs (like a Gaussian Mixture Model over pose space) could be used, which a downstream application or optimizer might refine with additional clues.

This connects to the idea of Bayesian deep learning for motion capture, providing not just an estimate but a confidence (which could be crucial for, say, alerting if the system is unsure of the pose).

Conclusion
--------------

Fitting the SMPL model to IMU data via learning-based methods has evolved rapidly from early feasibility (DIP) to sophisticated systems (TransPose, TIP, PIP) that address many limitations. Learning-based methods excel at real-time performance and leveraging motion data to fill in the gaps left by sparse sensors.

By integrating ideas from sequential modeling (RNNs, Transformers) and combining with physical reasoning (contact constraints, dynamics), current systems achieve impressive accuracy – often within a few centimeters error for key joints.

The remaining issues of drift and unusual motions are being tackled with creative hybrids and larger training sets.

The outlook is that wearable motion capture will become increasingly accurate and practical, with perhaps just a handful of IMUs needed to drive high-fidelity human avatars in real time.

The synergy of learning and physics is a promising avenue: we can expect future research to focus on end-to-end differentiable physics-informed networks that learn from both data and physical laws to achieve robust IMU-based human tracking in any environment.

The ultimate goal is a system that you can strap on a few sensors and forget about – it will just reliably translate your movements into a virtual body, whether you're walking, dancing, or climbing a wall, all without cameras.

The progress reviewed in this chapter suggests that goal is on the horizon, making IMU-based pose estimation a thrilling area of ongoing research for students and experts alike.