High-Precision Hands in Whole-Body Framework: A modular 3D whole-body pose estimation framework that significantly surpasses previous methods in hand accuracy.
Single-Image Inference: Efficiently recovers full-body mesh and poses from a single RGB image without requiring complex multi-view setups.
SMPL-X Standard Output: Directly outputs expressive SMPL-X parameters, including body, hands, and face, ensuring compatibility with standard graphics and animation pipelines.
Plug-and-Play Modularity: Seamlessly integrates pre-trained body and hand estimators through a lightweight modulator, achieving state-of-the-art results without expensive full-body retraining.
Limitations of previous works
Hand-only Estimators: Recover isolated hands well but fail during interactions due to a lack of full-body context.
Whole-body Estimators: Capture global structure but lack hand accuracy because whole-body datasets have limited hand diversity.
Naïve Combination: Simply attaching hand outputs to the body leads to implausible wrist poses, especially under occlusion, as they ignore the upper-body kinematic chain.
Hand4Whole++
Efficient Learning under Limited Supervision: Primarily trained on hand-only datasets to capture diverse and challenging hand poses, despite the absence of full-body labels.
Preserving Pre-trained Expertise: Employs foundational whole-body and hand pose estimators, keeping them frozen during training to maintain their specialized capabilities and generalization.
Lightweight Optimization: Only the CHAM module is trained to modulate whole-body features with hand-centric cues, providing a highly efficient and practical "plug-and-play" solution.
Architecture
Hand4Whole++ is a modular framework that bridges the supervision gap between whole-body and hand-only estimation without retraining foundational models.
Conditional Hands Modulator (CHAM): A lightweight, trainable module that refines whole-body features by injecting informative, hand-specific cues.
Frozen pre-trained Estimators: Leverages specialized whole-body and hand pose estimators by keeping them frozen to preserve their original expertise.
Efficient Training: Only the CHAM module is trained, providing a practical, high-performance "plug-and-play" solution.
Decoupled Transfer: Hand-specific accuracy is incorporated through CHAM for wrists and upper-body poses, while finger details are transferred via rigid alignment.
Conditional Hands Modulator (CHAM)
Hand-Specific Conditioning: Injects informative hand features into the whole-body stream to refine wrist orientation and upper-body kinematics.
Spatially Aligned Modulation: Uses inverse affine transformations and zero-initialized convolutions to maintain precise spatial alignment with the global body context.
Lightweight & Efficient: Optimized for speed, adding only 10ms of latency while keeping pre-trained estimators frozen.
Finger and shape transfer
High-Fidelity Integration: Leverages stable local finger poses from specialized estimators while discarding unstable global wrist predictions.
CHAM-Guided Orientation: Final wrist placement and orientation are determined by the body model, which is significantly refined by CHAM for global consistency.
Differentiable Rigid Alignment: Uses a differentiable transformation based on wrist and MCP joints to seamlessly align the detailed hand mesh to the body.
CHAM ablation
Comparison with Fine-tuning: While directly fine-tuning a whole-body model on hand-centric datasets improves hand accuracy, it often causes the model to overfit, leading to distorted and implausible body poses.
Preserving Generalization: Hand4Whole++ maintains the original model's robust body reasoning while significantly boosting hand precision by modulating features through the frozen backbone.
Anatomical Coherence: Unlike naive fine-tuning, our CHAM-based approach ensures that hand enhancements are kinematically consistent with the entire upper-body structure.