IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis

ACM Transactions on Graphics (SIGGRAPH Asia 2022)

Jingxiang Sun¹, Xuan Wang², Yichun Shi³, Lizhen Wang¹, Jue Wang², Yebin Liu¹

¹Tsinghua University ²Tencent AI Lab ³ByteDance Inc.

Abstract

Existing 3D-aware facial generation methods face a dilemma in quality ver- sus editability: they either generate editable results in low resolution, or high quality ones with no editing flexibility. In this work, we propose a new approach that brings the best of both worlds together. Our system consists of three major components: (1) a 3D-semantics-aware generative model that produces view-consistent, disentangled face images and semantic masks; (2) a hybrid GAN inversion approach that initialize the latent codes from the semantic and texture encoder, and further optimized them for faithful reconstruction; and (3) a canonical editor that enables efficient manipulation of semantic masks in canonical view and producs high quality editing re- sults. Our approach is competent for many applications, e.g. free-view face drawing, editing and style control. Both quantitative and qualitative results show that our method reaches the state-of-the-art in terms of photorealism, faithfulness and efficiency.

Overview

Pipeline of our 3D generator and encoders. The 3D generator (upper) consists of several parts. First, a StyleGAN feature generator G_feat constructs the spatially aligned 3D volumes of semantic and texture in an efficient tri-plane representation. To decouple different facial attributes, shape and texture codes are injected separately into both the shallow and the deep layers of G_feat. Moreover, the deep layers are designed to three parallel branch corresponding to each feature plane to reduce the entanglement among them. Given the generated 3D volumes, RGB images and semantic masks can be rendered jointly via the volume rendering and a 2D CNN-based up-sampler. Encoders (lower) embeds the portrait images and corresponding semantic masks into the texture and semantic latent codes by two independent proposed encoders. With a predicted camera pose, Then the fixed generator reconstructs the portrait under the predicted camera pose. In order to eliminate pose effect, we jointly train a canonical editor which takes as input the portrait images and semantic masks under the canonical view, with the consistency enforcement.

Citation

@article{sun2022ide,
 title = {IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis},
 author = {Sun, Jingxiang and Wang, Xuan and Shi, Yichun and Wang, Lizhen and Wang, Jue and Liu, Yebin},
 journal = {ACM Transactions on Graphics (TOG)},
 volume = {41},
 number = {6},
 articleno = {270},
 pages = {1--10},
 year = {2022},
 publisher = {ACM New York, NY, USA},
 doi={10.1145/3550454.3555506},
}

Acknowledgement

This paper is sponsed by NSFC No. 62125107.