PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image
ICCV 2025

What is PERSONA?

  • PERSONA is our new framework that creates animatalbe 3D human avatar from a single.
  • Supports pose-driven deformations (i.e., non-rigid deformations of clothing).
  • High authenticity (keeps the identity of the input image).
  • Whole-body animation capability including facial expression and hand gestures.
  • Requires only a single image and animatable with SMPL-X parameters.
  • Compared to previous 3D-based methods, our PERSONA better represents non-rigid deformations of clothes. Turn the music on!


    Compared to previous diffusion-based methods, our PERSONA better preserves identity of the person. Turn the music on!


    High authenticity with pose-driven deformations

  • Existing 3D-based methods (e.g., LHM and AniGS) achieve better authenticity than diffusion-based methods, but lack pose-driven deformations of clothing.
  • Existing diffusion-based methods (e.g., MimicMotion and StableAnimator) achieve better pose-driven deformations of clothing than 3D-based methods, but lack authenticity.
  • scales Our PERSONA integrates the strengths of both approaches to achieve a personalized whole-body 3D avatar with pose-driven deformations.


    Pose-rich training video generation

  • Generates diverse motion sequences (e.g., dance, rotation, light punches, kicks) from a single input image using MimicMotion.
  • Provides rich pose diversity to learn non-rigid clothing and body deformations that cannot be captured from the input image alone.
  • scales The input image and the generated videos together form our training set.


    Balanced sampling

  • Alternates between the input image and generated video frames during training to oversample the input image.
  • Mitigates identity drift in diffusion-generated frames, preserving consistent facial features and clothing patterns.
  • Reduces baked-in artifacts by detecting seam boundaries via Sobel-filtered positional maps and supervising these regions only on generated frames.
  • scales Illustration of our balanced sampling.



    Geometry-weighted optimization

  • Uses low weights for image loss and high weights for geometry loss (mask, depth, normal, part segmentation) when training the deformation model.
  • Geometry cues are more stable than images in noisy generated videos, so prioritizing geometry improves robustness.
  • scales Illustration of our geometry-weighted optimization.



    Architecture

  • Uses MLPs to represent the pose-driven deformations (i.e., non-rigid deformations of clothing).
  • scales Illustration of our architecture.



    Comparison to previous state-of-the-art avatars

  • We obtain SMPL-X parameters with SMPLest-X, Hand4Whole, and DECA.
  • Then, we drive our PERSONA with the obtained SMPL-X parameters.
  • All the avatars are created from a casually captured monocular video.
  • Compared to previous 3D-based methods (LHM and AniGS), our PERSONA better represents non-rigid deformations of clothes.
    Compared to previous diffusion-based methods (MimicMotion and StableAnimator), our PERSONA better preserves identity of the person. Turn the music on!
    Compared to previous state-of-the-art video-based method (ExAvatar), our PERSONA achieves comparable results only with a single image.


    Motion transfer from in-the-wild videos

  • We obtain SMPL-X parameters with SMPLest-X, Hand4Whole, and DECA.
  • Then, we drive our PERSONA with the obtained SMPL-X parameters.
  • All the avatars are created from a casually captured monocular video.


  • Citation

    Acknowledgements

    The website template is borrowed from ExAvatar.