Towards Robust and Expressive Whole-body Human Pose and Shape Estimation

Introduction

Given a single RGB image, 3D human pose and shape estimation aims to reconstruct human body meshes with the help of statistic models. It has gained widespread attention owing to its extensive applications in various fields, including robotics, computer graphics, and augmented/virtual reality. SMPL [1], MANO [2], FLAME [3] and popular statistical models to individually reconstruct different parts of human body, face, and hand. Recently, there is a growing interest in whole-body estimation, which jointly estimates the pose, hand gestures and facial expressions of the entire human body from the input. Compared to previous work that only learns body pose and shape parameters, expressive human pose and shape estimation models (SMPLX[4]) detailed facial expression and hand gestures by learning extra hand and face poses, and facial expression.

Robustness study of existing models

To better understand the strengths and weaknesses of current state-of-the-art whole-body human pose and shape estimation models, we conducted a robustness study involving ten controlled augmentations of three categories (Figure 1):

  1. Image-Variant Augmentations affect the visual quality of the image without altering the objects’ 3D poses or positions, such as color adjustments, contrast, sharpness, and brightness.
  2. Location-Variant Augmentations shift the subject’s position within the image without changing its pose, including movements like translation and scaling.
  3. Pose-Variant Augmentations simultaneously adjust the 3D pose and location of the subject, such as rotations.

Our findings revealed that while these models are generally good at dealing with image-variant augmentations, they struggle with location-variant augmentation. As demonstrated in Figure 2, small shifts in alignment (left) and scales (right) lead to substantially higher errors, indicating a high sensitivity to positional changes.

The sensitivity of pose and shape estimation models to the subject’s location in the image has several implications:

  • Imperfect Crops: In real-world applications, third-party detectors or pose estimation models are employed to locate the person, but these tools are not always accurate, leading to imperfect crops.
  • Complications in Whole-body Estimation: The challenge is exacerbated in whole-body estimation pipeline. Even if the body crops are perfect, hand and face crops their respective networks are often imperfect, affecting the final outputs.
Our solution: RoboSMPLX

Our robustness study also indicates certain limitations of existing pose and shape estimation models. First, high sensitivity to the subject’s location in the image indicates that the model faces difficulties in subject localization. Second, the deterioration of performance in the face of such variations suggests that the model struggles to extract meaningful features. Under alterations in translation or scale, the subject remains within the image frame, though the proportion of background content may vary. It is difficult for existing methods to effectively disregard irrelevant background elements and extract relevant features related to the subject of interest.Thirdly, In certain instances, despite having precise subject localization, the model fails to produce properly aligned results.

To tackle each of the above-mentioned problems, we developed RoboSMPLX to enhance the robustness of whole-body pose and shape estimation through three specialized components (Figure 3):

  1. Localization Module helps to obtain accurate localization of the subject. While simpler networks directly predict the parameters from backbone features, this module implements sparse (2D keypoints) and dense (2D part segmentation maps) prediction branches to ensure the model is aware of the location and semantics of the subject’s parts in the image. The learned location of the joint positions are helpful in recovering the relative rotations.
  2. Contrastive Feature Extraction Module encourages the model to produce consistent features irrespective of different augmentations and improve the model’s generalization ability and robustness to a broader range of real-world scenario. This module incorporates a pose- and shape-aware contrastive loss, along with positive samples. By minimizing the contrastive loss, the model is encouraged to extract meaningful invariant representations for the same subject, even when presented with different augmentations, making it robust to various transformations.
  3. Pixel Alignment Module learn more accurate pose, shape and camera parameters. This component uses differentiable rendering to ensure a more precise pixel alignment of the projected mesh.
Results and Conclusion

In a nutshell, we achieved more accurate subject localization, robust feature extraction and pixel alignment with RoboSMPLX. In our paper, we show quantitative and qualitative benchmarks for our body, hand, face and wholebody models. Our model produces more consistent results and less errors under various location-variant augmentation. For more details, please refer to our NeurIPS paper.

Paper link: https://arxiv.org/abs/2312.08730

References

[1] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, October 2015.

[2] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), November 2017.

[3] Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. URL https://doi.org/10.1145/3130800.3130813.

[4] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. Osman, DImitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2019-June, pp. 10967–10977, 2019. ISBN 9781728132938. doi: 10.1109/CVPR.2019.01123.

Author