IDOL: Instant Photorealistic 3D Human Creation

^*Equal contributions | ^†Corresponding authors

Work done during the internship at Tencent by Yiyu Zhuang, Jiaxi Lv, and Hao Wen

Demo Video

Watch the demo video for a first look at IDOL in action. Experience the impressive capabilities of our method as it brings single images to life, reconstructing photorealistic, animatable 3D humans with remarkable accuracy and detail.

Abstract

This work introduces IDOL, a feed-forward, single-image human reconstruction framework that is fast, high-fidelity, and generalizable. Leveraging a large-scale dataset of 100K multi-view subjects, our method demonstrates exceptional generalizability and robustness in handling diverse human shapes, cross-domain data, severe viewpoints, and occlusions. With a uniform structured representation, the reconstructed avatars are directly animatable and easily editable, providing a significant step forward for various applications in graphics, vision, and beyond.

Our method (a) provides a fast and high-fidelity feed-forward single-image human reconstruction pipeline; (b) leverages a large-scale multi-view human dataset to handle diverse shapes, domains, viewpoints, and occlusions; (c) produces structured, animatable, and easily editable avatars.

Introduction

Explore how IDOL redefines single-image 3D human reconstruction. This introductory video presents the method's core ideas, providing a clear overview of how our framework achieves photorealistic reconstruction, seamless animation/editing, and robust performance across challenging scenarios.

Pipeline

The architecture of IDOL, a fully differentiable transformer-based framework for reconstructing animatable 3D human from a single image. The model integrates a high-resolution encoder and fuses image tokens with learnable UV tokens through the UV-Alignment Transformer. A UV Decoder predicts Gaussian attribute maps as intermediate representations, capturing the human's geometry and appearance in a structured 2D UV space defined by the SMPL-X model. These maps, in conjunction with the SMPL-X model, represent a 3D human avatar in a canonical space, which can be animated using linear blend skinning (LBS). The model is optimized using multi-view images with diverse poses and identities, learning to disentangle pose, appearance, and shape.

Dataset

HuGe100K

We introduce a large-scale multi-view human dataset containing over 100K photorealistic subjects for noncommercial research purposes. The dataset supports robust 3D reconstruction tasks and facilitates research advancements in human modeling.

License and Attribution

This dataset includes images derived from the DeepFashion dataset, originally provided by MMLAB at The Chinese University of Hong Kong. The use of DeepFashion images in this dataset has been authorized by the original authors solely for the purpose of creating and distributing this dataset. Users must not further reproduce, distribute, sell, or commercially exploit any images or derived data originating from DeepFashion. For any subsequent or separate use of the DeepFashion data, users must directly obtain authorization from MMLAB and comply with the original DeepFashion License.

Apply for Dataset Access

To access the dataset, please submit an application using the link below: [Dataset Access]

Animation Results

Given a reference image that provides the target character's identity and a reference video that supplies the pose, our method animates the reconstructed human to match the movements observed in the video.
Using the uniform 3D representation and SMPL-X pose parameters, IDOL achieves precise control over the avatar's pose, including detailed finger movements.

Citation

If you find our work useful, please cite it using the following format:

            @misc{zhuang2024idolinstantphotorealistic3d,
                title={IDOL: Instant Photorealistic 3D Human Creation from a Single Image}, 
                author={Yiyu Zhuang and Jiaxi Lv and Hao Wen and Qing Shuai and Ailing Zeng and Hao Zhu and Shifeng Chen and Yujiu Yang and Xun Cao and Wei Liu},
                year={2024},
                eprint={2412.14963},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2412.14963}, 
          }