Enhancing Hands in 3D Whole-Body Pose Estimation
with Conditional Hands Modulator

CVPR 2026

What is Hand4Whole++?

  • High-Precision Hands in Whole-Body Framework: A modular 3D whole-body pose estimation framework that significantly surpasses previous methods in hand accuracy.
  • Single-Image Inference: Efficiently recovers full-body mesh and poses from a single RGB image without requiring complex multi-view setups.
  • SMPL-X Standard Output: Directly outputs expressive SMPL-X parameters, including body, hands, and face, ensuring compatibility with standard graphics and animation pipelines.
  • Plug-and-Play Modularity: Seamlessly integrates pre-trained body and hand estimators through a lightweight modulator, achieving state-of-the-art results without expensive full-body retraining.




  • Limitations of previous works

  • Hand-only Estimators: Recover isolated hands well but fail during interactions due to a lack of full-body context.
  • Whole-body Estimators: Capture global structure but lack hand accuracy because whole-body datasets have limited hand diversity.
  • Naïve Combination: Simply attaching hand outputs to the body leads to implausible wrist poses, especially under occlusion, as they ignore the upper-body kinematic chain.
  • scales


    Hand4Whole++

  • Efficient Learning under Limited Supervision: Primarily trained on hand-only datasets to capture diverse and challenging hand poses, despite the absence of full-body labels.
  • Preserving Pre-trained Expertise: Employs foundational whole-body and hand pose estimators, keeping them frozen during training to maintain their specialized capabilities and generalization.
  • Lightweight Optimization: Only the CHAM module is trained to modulate whole-body features with hand-centric cues, providing a highly efficient and practical "plug-and-play" solution.

  • Architecture

  • Hand4Whole++ is a modular framework that bridges the supervision gap between whole-body and hand-only estimation without retraining foundational models.
  • Conditional Hands Modulator (CHAM): A lightweight, trainable module that refines whole-body features by injecting informative, hand-specific cues.
  • Frozen pre-trained Estimators: Leverages specialized whole-body and hand pose estimators by keeping them frozen to preserve their original expertise.
  • Efficient Training: Only the CHAM module is trained, providing a practical, high-performance "plug-and-play" solution.
  • Decoupled Transfer: Hand-specific accuracy is incorporated through CHAM for wrists and upper-body poses, while finger details are transferred via rigid alignment.
  • scales



    Conditional Hands Modulator (CHAM)

  • Hand-Specific Conditioning: Injects informative hand features into the whole-body stream to refine wrist orientation and upper-body kinematics.
  • Spatially Aligned Modulation: Uses inverse affine transformations and zero-initialized convolutions to maintain precise spatial alignment with the global body context.
  • Lightweight & Efficient: Optimized for speed, adding only 10ms of latency while keeping pre-trained estimators frozen.
  • scales



    Finger and shape transfer

  • High-Fidelity Integration: Leverages stable local finger poses from specialized estimators while discarding unstable global wrist predictions.
  • CHAM-Guided Orientation: Final wrist placement and orientation are determined by the body model, which is significantly refined by CHAM for global consistency.
  • Differentiable Rigid Alignment: Uses a differentiable transformation based on wrist and MCP joints to seamlessly align the detailed hand mesh to the body.
  • scales



    CHAM ablation

  • Comparison with Fine-tuning: While directly fine-tuning a whole-body model on hand-centric datasets improves hand accuracy, it often causes the model to overfit, leading to distorted and implausible body poses.
  • Preserving Generalization: Hand4Whole++ maintains the original model's robust body reasoning while significantly boosting hand precision by modulating features through the frozen backbone.
  • Anatomical Coherence: Unlike naive fine-tuning, our CHAM-based approach ensures that hand enhancements are kinematically consistent with the entire upper-body structure.
  • scales



    Citation

    Acknowledgements

    The website template is borrowed from ExAvatar.