PERSONA is our new framework that creates animatalbe 3D human avatar from a single.
Supports pose-driven deformations (i.e., non-rigid deformations of clothing).
High authenticity (keeps the identity of the input image).
Whole-body animation capability including facial expression and hand gestures.
Requires only a single image and animatable with SMPL-X parameters.
Compared to previous 3D-based methods, our PERSONA better represents non-rigid deformations of clothes. Turn the music on!
Compared to previous diffusion-based methods, our PERSONA better preserves identity of the person. Turn the music on!
High authenticity with pose-driven deformations
Existing 3D-based methods (e.g., LHM and AniGS) achieve better authenticity than diffusion-based methods, but lack pose-driven deformations of clothing.
Existing diffusion-based methods (e.g., MimicMotion and StableAnimator) achieve better pose-driven deformations of clothing than 3D-based methods, but lack authenticity.
Our PERSONA integrates the strengths of both approaches to achieve a personalized whole-body 3D avatar with pose-driven deformations.
Pose-rich training video generation
Generates diverse motion sequences (e.g., dance, rotation, light punches, kicks) from a single input image using MimicMotion.
Provides rich pose diversity to learn non-rigid clothing and body deformations that cannot be captured from the input image alone.
The input image and the generated videos together form our training set.
Balanced sampling
Alternates between the input image and generated video frames during training to oversample the input image.
Mitigates identity drift in diffusion-generated frames, preserving consistent facial features and clothing patterns.
Reduces baked-in artifacts by detecting seam boundaries via Sobel-filtered positional maps and supervising these regions only on generated frames.
Illustration of our balanced sampling.
Geometry-weighted optimization
Uses low weights for image loss and high weights for geometry loss (mask, depth, normal, part segmentation) when training the deformation model.
Geometry cues are more stable than images in noisy generated videos, so prioritizing geometry improves robustness.
Illustration of our geometry-weighted optimization.
Architecture
Uses MLPs to represent the pose-driven deformations (i.e., non-rigid deformations of clothing).