Do as I Do: Pose Guided Human Motion Copy (2024)

Sifan Wu,Zhenguang Liu,Beibei Zhang,Roger Zimmermann,,Zhongjie Ba,Xiaosong Zhang,andKui Ren,Sifan Wu is with the School of Computer Science and Technology, Jilin University, Changchun 130015, China (email: wusifan2021@gmail.com).Zhenguang Liu (Corresponding author), Zhongjie Ba, and Kui Ren are professors of School of Cyber Science and Technology, Zhejiang University, Hangzhou 310058, China (email: liuzhenguang2008@gmail.com, {zhongjieba,kuiren}@zju.edu.cn).Beibei Zhang is with Zhejiang Lab, Hangzhou, Zhejiang Province, 311121, China (e-mail: bzeecs@gmail.com).Roger Zimmermann is a professor of School of Computing, National University of Singapore, 119613, Singapore (e-mail: rogerz@comp.nus.edu.sg).Xiaosong Zhang is with the Center for Cyber Security, the College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China (e-mail: johnsonzxs@uestc.edu.cn).

Abstract

Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers.

Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss tobridge the gap between pose and appearance.(2) We present an episodic memory module in the pose-to-appearance generationto propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face tooptimize facial details and refine each key body part with a dedicated local GAN.(3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.

Index Terms:

Motion copy, deep fake, Gromov-Wasserstein, fake video.

1 Introduction

The seismic breakthrough of artificial intelligence has given rise to numerous intriguing and appealing video applications. A compelling application is to copy the motion from a source person to a target person, generating a fake video of the target person enacting the same motion as the source person.Motion copy empowers an untrained person to be depicted in videos dancing like a professional dancer, acting like a Kung Fu star, and playing basketball like an NBA player. Correspondingly, motion copy finds its applications in a wide spectrum of scenarios includinganimation production[1, 2], augmented reality[3, 4], and social media entertainment[5]. Interestingly, the source and target persons might be greatly different in body shape, appearance, and race.

Fundamentally, motion copy amounts to learning a mapping from the given video of a source person to the target video of a target person, as shown in Fig.1.The task is inherently challenging due to the high dimensionality of the mappingand subtle motion details to be generated. Technically, each frame of the target fake video comprises millions of pixels. Even a few wrong pixels are highly noticeable to human observers.

Generally, motion copy is carried out in two steps. In the first step, the pose or mesh sequence of the source person is extracted from the source video. In the second step, motion copy learns a generative model that maps the intermediate representation (pose or mesh sequence) to the appearance of the target person, synthesizing the fake video where the target person enacts the motion of the source person.One line of works extracts human poses as the intermediate representation,which are referred to as pose-guided methods[6, 7, 8].Another line of works captures human body meshes as the intermediate representation,which are termed as warping-guided methods[9, 10]. Recently, a few approaches advocate to transfer motion directly in the image feature space[11] or introduce neural rendering techniques to reconstruct human templatesfrom static images[12].In this paper, we focalize pose-guided target video generation, in view of its efficiency and robustness to cloth deformation.

Do as I Do: Pose Guided Human Motion Copy (1)

Upon investigating and experimenting on the released implementations ofstate-of-the-art methods[6, 13, 7, 9, 10], we empirically observe the following issues:(1) Current pose-to-appearance generation models primarily hinge on either L1 or L2 loss to train a GANthat bridges the gap between a pose and its target appearance.Such GANs necessitate a large number of training samples.Nevertheless, we often have only one or a few videos of the target person for training.(2) Whereas existing methods achieve plausible results on a broad stroke, theissues of distorted faces, hands, and feet are quite rampant. The high-fidelity textures of the face, hands, and feet, which either require sufficient details or have flexible movements, are usually missing.(3) Most existing methods generate each frame independently,ignoring the fact that adjacent frames are closely related to each other.This usually leads to temporal inconsistency in the generated video.

In this paper, we embrace three key designs to tackle the challenges.(1) We augment our pose-to-appearance GAN with a theoretically motivated Gromov-Wasserstein loss and a perceptual loss,which alleviates the problem of scarce training samples and attains realistic results.(2) We propose an episodic memory module in the pose-to-appearance generation so that the model continuously accumulate experience from its past poor generations. We also utilize geometrical cues of the face tooptimize facial details and refine each key body part with a dedicated local GAN.(3) We instill spatial coherency and temporal consistency into our generated video by designing a spatial-temporal discriminator.

Interestingly, most existing methods typically focus on fake human motion generation. In this paper, we explore applying our approach to a range of objects including humans, fish, and mice. Extensive experiments are conducted on benchmark datasets including iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets. Empirically, our approach outperforms state-of-the-art approaches by a large margin (7.2% and 12.4% gain in PSNR and FID) in fake video generation.

To summarize, the key contributions of this work are:

  • We investigate the novel framework of incorporating Gromov-Wasserstein loss and perceptual loss for pose-to-appearance generation, which encodespairwise distance constraints and attains realistic results.

  • In light of the divide-and-conquer strategy [14], we polish the local regions of key body parts including face, hands, and feet separately with dedicated local GANs. We empirically present a new vector field incorporating ears to characterize the face orientation, which serves to identify frames with similar face orientations to enhance the generated face.

  • Extensive experiments show that our approach achieves state-of-the-art performance. Besides, our approach could be generalized to other articulated objects, including fish and mouse.

We would like to share that this paper is the continuation of our earlier work “Copy Motion From One to Another: Fake Motion Video Generation” published in IJCAI 2022 [55], which is accepted as a Long Presentation paper at an acceptance rate of 3.75% (the paper acceptance rate is 15%, Long Presentation papers are papers that rank top 25% among the accepted papers). This work is distinct from the conference version paper in four aspects. (1) Unlike our earlier work, which generates each frame independently in a sequence-to-frame framework, this work generates k𝑘kitalic_k consecutive foreground frames simultaneously with a sequence-to-sequence framework, encoding the wealth of temporal context information. (2) In this paper, we propose a novel episodic memory component that stores the poor generations of the model and replays these samples occasionally to enforce the model continuously learning from its defects. (3) To capture the orientation of the human face, in contrast to the mouth vector on the face employed in our earlier work, we experimentally discover that the geometric information from the ears vectors on the face is more stable and significant. Inspired by this, we present a new vector field to characterize the face orientation. (4) This work consistently outperforms the earlier work on iPER and ComplexMotion datasets and provides more insights and findings in human motion copy. Significantly, our earlier work focuses on only human motion. In this paper, we explore applying our approach to a range of objects including humans, fish, and mice.

The remainder of the paper is organized as follows. In Section2, we give a brief introduction to the related work of image synthesis and human motion copy. Thereafter, we elaborate on the proposed method in Section3. In Section4, we present the experiments and performance analysis. Finally, we conclude the paper in Section5.

2 Related Work

Before diving into the details of our approach, let us first review and categorize the related works on motion copy. We first recap the holistic view of image synthesis, which provides a broader range of research pertinent to human motion copy.We then present the hitherto human motion copy approaches, which can be cast into three categories, namely pose-guided human motion copy, warping-guided human motion copy, and no-intermediary human motion copy.

2.1 Image Synthesis

Earlier research resorts to Variational Autoencoder[15] and Auto-Regressive models[16] for image synthesis. Recently, the proposal and application [17, 18] of Generative Adversarial Networks (GANs)[19] have led to great advancement in image generation. Technically, GANs utilize a generator-discriminator architecture, where the generator produces images and the discriminator distinguishes between real and fake images. The generator and discriminator are iteratively optimized in a two-player min-max game. Conditional GANs synthesize the images under a given conditional input (e.g., class labels). Isola et al.[20] consider the conditional GAN as a general solution to accomplish image synthesis tasks such as image reconstruction, style transfer, and image coloring. Rather than generating a vanilla image,[21, 18] propose a two-stage GAN to produce a high-resolution image. Upon initiating photo-realistic images, Gao L et al.[22] propose a lightweight network structure that contains a generator and two discriminators to generate two images with different sizes in a feed-forward process. GANs have made remarkable progress in recent years on many tasks [23, 24, 25, 26]. However, it is well known that GANs are difficult to train and the training process is usually unstable. Towards an easy-to-train and stable GAN, Arjovsky et al.[27, 28] propose a WGAN that introduces a novel Wasserstein loss. Their proposed Wasserstein metric has a superior smoothing property compared to KL divergence of GANs, which can theoretically solve the gradient vanishing problem. A drawback of these methods lies in requiring a considerable amount of samples to train a model, which might limit their applications in human motion copy where we may not have a large number of training samples available.

2.2 Human Motion Copy

Existing approaches for human motion copy can be roughly categorized into three groups, namely pose-guided, warping-guided, and no-intermediary methods.

Pose-Guided Human Motion Copy.[29] is the first seminal work of human motion copy, which proposes a two-stage detailed generation from coarse to fine. Since then, a great deal of research has been conducted on human motion copy.Pumarola et al. [30, 31, 32] employ generators and discriminators to reconstruct the target person image with arbitrary poses. Esser et al. [33, 13] propose a unique conditional U-Net, which regulates the output of the variant auto-encoder on appearance. However, these researches are extremely reliant on large-scale training samples, which is difficult to fulfill in practical applications. Ren et al. [34, 35] achieve great image quality with posture augmentation and novel image refinement. Ghafoor et al. [8] proposed a novel video-to-video action transfer framework, which consists of a cascaded sequence of action transfer block with multi-resolution structure similarity loss. Yang et al. [36] perform human video motion transfer in an unsupervised manner, which utilizes the invariance of three orthogonal variation factors, including motion, structure, and view. Nonetheless, these methods fail to take into account the importance of maintaining facial details during the transfer process of human motion. Although Chan et al. [6] introduce a face enhancement module, due to the overfitting problem of GAN, it is not effective in generating satisfactory faces. In contrast, our body parts enhancement polishes the generated face with a self-supervised training scheme and refines the key body portion using dedicated local GANs.

Warping-Guided Human Motion Copy. Dong et al.[37, 9, 38] disentangle the human image into action and appearance, and then perform motion imitation by a warping GAN that distorts the image according to reference poses. Similar to the above method, Shysheya et al. [39, 40] introduce an attention mechanism between pose skeleton and image to generate UV coordinates and then warps patch-level human texture maps to adapt the UV coordinates. However, these methods are limited by the diversity of texture maps, resulting in blurs and artifacts of the generated video. Han et al. [41] focus on learning an appearance flow that warps the clothing of a target person to the corresponding area of the source person. Wei et al. [10] warp the motion of the target human image and then refine the details. Nevertheless, the warping-based motion copy method, by nature, has difficulties in coping with rapid human motion. Moreover, these methods disregard the temporal consistency across frames, resulting in discontinuous video and visual artifacts.

No-Intermediary Human Motion Copy. There are also attempts that direct their efforts at motion copy without any intermediaries (i.e., poses or meshes). Joo et al. [11] employ two specific losses to constrain the GAN which generates a fusion image (one’s identity with another’s shape). However, it deeply concentrates on upper body motion copy (without legs and feet) and eye style transfer. In contrast, our model does not only achieve human whole-body motion copy but also boldly tries motion copy between animals. To the best of our knowledge, we are the first to replicate movements in other articulated objects of the same species, including fish and mice.

Do as I Do: Pose Guided Human Motion Copy (2)

3 Our Method

Problem Formulation. Broadly, given two videos, one video for the target person whose appearance we would like to synthesize and the other video for the source person whose actions we would like to copy[6], we are interested in generating a fake video of the target person performing the same actions as the source person.

Method Overview. An overview of our method FakeVideo is outlined in Fig.2. Overall, FakeVideo consists of four key components: (1) The pose extraction module draws out the human poses from the video of the source person, where the poses serve as motion copy intermediaries. The foreground and background separation module segments the video of the target person into foreground (i.e. human body) sequence and background sequence. (2) The pose-to-appearance GAN generates an appearance sequence for the target person from the extracted pose sequence. The local enhancement module is further engaged to polish the local regions of key body parts (face, hands, and feet). (3) The episodic memory component stores the poor generations of the model and replays these samples occasionally to enforce the model continuously learns from its own defects. (4) The foreground and background fusion module generates a fake video by fusing the polished foreground sequence and the background sequence. We would like to highlight that our generator has an edge in adopting Gromov-Wasserstein and perceptual losses while being equipped with memory components. Meanwhile, our discriminator games in spatial and temporal dual constraints, driving the generator to approach better generations. In what follows, we elaborate on the four key components in detail.

3.1 Pose Extraction and Foreground-Background Separation

Pose Extraction. The goal of motion copy is to learn a mapping between a given video of the source person and the target video of the target person. Unfortunately, each frame of the two videos has millions of pixels, making it extremely difficult to acquire the mapping directly. Inspired by the rapid development of pose estimation techniques[42, 43, 44], we utilize pose skeleton sequence as the intermediary for motion copy. The pose sequence unambiguously indicates the motions and can be used to guide body appearance generation. To this end, we shift to learn a mapping from the poses to the body appearance sequence. Particularly, we adopt pre-trained pose detectors OpenPose[45] and DCPose [46] to extract poses from videos.

Do as I Do: Pose Guided Human Motion Copy (3)

Foreground and Background Separation. The pose skeleton clearly characterizes the motion, however, we believe it is too ambitious to synthesize a full frame (foreground and background) directly conditioned on a desired pose. Instead, an important step of our pipeline is to compute a mask matrix M𝑀Mitalic_M, which is leveraged to explicitly disentangle each video frame into foreground and background. We devise a generator to concentrate on synthesizing only the foreground sequence from poses. This facilitates our model to avoid considering a large number of background pixels in the pose-to-appearance generation, resulting in a more realistic appearance of the human and faster convergence for the network. Specifically, we adopt the off-the-shelf Mask-RCNN [47] to obtain the mask matrix M𝑀Mitalic_M. In addition, we employ image inpainting technology[48] to fill the removed foreground pixels in the background.

3.2 Pose-to-appearance Generation and Local Enhancement

Now, we consider how to generate an appealing body foreground sequence upon a given pose sequence. Technically, we design a pose-to-appearance generation GAN (appearance GAN), consisting of a generator that incorporates perceptual loss and Gromov-Wasserstein loss, and a discriminator that exerts spatio-temporal dual constraints.

Dense Skip Connections in Generator. The structure of the generator is illustrated in Fig.3, where we engage in a U-shaped architecture with multiple encoder-decoder layers. In conventional U-Net, a decoder layer solely connects to one symmetry encoder layer [49, 50]. These relatively isolated relationships between different level encoder-decoder layers lead to insufficient spatial information modeling in the encoding and decoding process. Explicitly, during the encoding process of conventional U-Net architecture, consecutive convolutions in the encoder would inevitably drop some low-level detailed features. To tackle the challenge, we devise dense skip connections in the U-shaped architecture. Our motivation is to preserve rich features from multiple levels rather than using only one level feature in the foreground generation. Therefore, as shown in Fig.3, instead of connecting a decoder at layer i𝑖iitalic_i with only the symmetric encoder at layer i𝑖iitalic_i, we add extra skip connections from the encoders at layers {1,2,,i1}12𝑖1\{1,2,\cdots,i-1\}{ 1 , 2 , ⋯ , italic_i - 1 } to the decoder at layer i𝑖iitalic_i. For example, decoder layer De-layer4 not only receives the feature information from the hop connection of encoder layer En-layer4 (as in the conventional U-Net), but also receives the feature information from encoder layers {En-layer1, En-layer2, En-layer3}. In this way, each decoder could integrate multi-level latent features and is able to access lower-level features.

Gromov-Wasserstein Loss and Perceptual Loss to Facilitate Appearance Generation. In the training phase, we extract the pose sequence and foreground sequence from the video of the target person, and train our generator network to learn how to capture the mapping function from the pose sequence to the corresponding foreground sequence of the target person. Existing methods typically address this pose-to-appearance problem with a conventional GAN, and measure the discrepancy between the generated foreground frame and the ground truth frame via a pixel-wise L2 loss. Such approaches, by nature, require a large number of training samples to reach convergence. To alleviate this issue, we propose a Gromov-Wasserstein loss that preserves the distance-structure of the feature space instead of the conventional pixel-wise L2 loss. Particularly, the Gromov-Wasserstein loss enforces that the generated fake frames should have the same feature distance structure as their corresponding ground truth frames. Put differently, if two ground truth frames Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Fjsubscript𝐹𝑗F_{j}italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are close to each other in the image feature space, the generated fake frames for them should also be close to each other. Conversely, if Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Fjsubscript𝐹𝑗F_{j}italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are far apart in the image feature space, the generated fake frames for them should also be far apart. In this way, we are able to train the network in a pairwise manner, where the training samples are multiplied. Besides the Gromov-Wasserstein loss, we also add a perceptual loss that further forces the generated frame to be consistent with the ground truth frame.

Formally, given a pose sequence P1,P2,,Pmsubscript𝑃1subscript𝑃2subscript𝑃𝑚\langle P_{1},P_{2},\cdots,P_{m}\rangle⟨ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩, the pose-to-appearance generation network synthesizes a foreground sequence F¯1,F¯2,,F¯msubscript¯𝐹1subscript¯𝐹2subscript¯𝐹𝑚\langle\overline{F}_{1},\overline{F}_{2},\cdots,\overline{F}_{m}\rangle⟨ over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩. Specifically, we denote the feature tensors of F¯1,F¯2,,F¯msubscript¯𝐹1subscript¯𝐹2subscript¯𝐹𝑚\langle\overline{F}_{1},\overline{F}_{2},\cdots,\overline{F}_{m}\rangle⟨ over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ as ^1,^2,,^msubscript^1subscript^2subscript^𝑚\langle\mathcal{\hat{F}}_{1},\mathcal{\hat{F}}_{2},\cdots,\mathcal{\hat{F}}_{m}\rangle⟨ over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩, and the feature tensors of the corresponding ground truth sequence F1,F2,,Fmsubscript𝐹1subscript𝐹2subscript𝐹𝑚\langle F_{1},F_{2},\cdots,F_{m}\rangle⟨ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ as 1,2,,msubscript1subscript2subscript𝑚\langle\mathcal{F}_{1},\mathcal{F}_{2},\cdots,\mathcal{F}_{m}\rangle⟨ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , caligraphic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩. Mathematically,

{^k}k=1m=Φ({F¯k}k=1m),{k}k=1m=Φ({Fk}k=1m)formulae-sequencesuperscriptsubscriptsubscript^𝑘𝑘1𝑚Φsuperscriptsubscriptsubscript¯𝐹𝑘𝑘1𝑚superscriptsubscriptsubscript𝑘𝑘1𝑚Φsuperscriptsubscriptsubscript𝐹𝑘𝑘1𝑚\displaystyle\{\mathcal{\hat{F}}_{k}\}_{k=1}^{m}=\Phi(\{{\overline{F}}_{k}\}_{%k=1}^{m}),\quad\{\mathcal{F}_{k}\}_{k=1}^{m}=\Phi(\{{F}_{k}\}_{k=1}^{m}){ over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = roman_Φ ( { over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) , { caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = roman_Φ ( { italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT )(1)

where Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) represents a pre-trained feature extraction backbone network. Heuristically, we show in Fig.3 that optimizing the Gromov-Wasserstein loss amounts to aligning the two groups of feature tensors so that the generated fake images preserve the distance structure of their corresponding ground truth images. We could view {^k}k=1msuperscriptsubscriptsubscript^𝑘𝑘1𝑚\{\mathcal{\hat{F}}_{k}\}_{k=1}^{m}{ over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and {k}k=1msuperscriptsubscriptsubscript𝑘𝑘1𝑚\{\mathcal{F}_{k}\}_{k=1}^{m}{ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT as discrete empirical distributions μ𝜇\muitalic_μ and τ𝜏\tauitalic_τ, which is given by:

μ=k=1m1mδ^k,τ=k=1m1mδk\displaystyle\begin{split}\mu=\sum_{k=1}^{m}\frac{1}{m}\delta_{\mathcal{\hat{F%}}_{k}},\quad\tau=\sum_{k=1}^{m}\frac{1}{m}\delta_{\mathcal{F}_{k}}\end{split}start_ROW start_CELL italic_μ = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_δ start_POSTSUBSCRIPT over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_τ = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_δ start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW(2)

where δ()subscript𝛿\delta_{(\cdot)}italic_δ start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT represents the Dirac delta distribution. Then, the Gromov-Wasserstein loss for our model can be formulated as:

GW(μ,τ)=minπΠi,j,k,l|^i^k1jl1|2πijπklsubscript𝐺𝑊𝜇𝜏subscript𝜋Πsubscript𝑖𝑗𝑘𝑙superscriptsubscriptnormsubscript^𝑖subscript^𝑘1subscriptnormsubscript𝑗subscript𝑙12subscript𝜋𝑖𝑗subscript𝜋𝑘𝑙\mathcal{L}_{GW(\mu,\tau)}=\min_{\pi\in\Pi}\sum_{i,j,k,l}\left|\left\|\mathcal%{\hat{F}}_{i}-\mathcal{\hat{F}}_{k}\right\|_{1}-\left\|\mathcal{F}_{j}-%\mathcal{F}_{l}\right\|_{1}\right|^{2}\pi_{ij}\pi_{kl}caligraphic_L start_POSTSUBSCRIPT italic_G italic_W ( italic_μ , italic_τ ) end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | ∥ over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ∥ caligraphic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - caligraphic_F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT(3)

where ΠΠ\Piroman_Π denotes a collection of point distributions with margins μ𝜇\muitalic_μ and τ𝜏\tauitalic_τ. The optimal transport matrix π𝜋\piitalic_π could be calculated by minimizing the square distance with L1 costs in the intra-space.

Inspired by [51, 52], an entropy regularization term is introduced to ensure tractability and reversible backpropagation in the optimal transport loss optimization. In addition, we utilize the Sinkhorn algorithm and the projected gradient descent method [51] to solve the entropy-regularized Gromov-Wasserstein loss. Technically, the process of optimizing Gromov-Wasserstein loss is outlined in Algorithm1.

1:Input: (i) generated feature tensors {^k}k=1m=Φ({F¯k}k=1m)superscriptsubscriptsubscript^𝑘𝑘1𝑚Φsuperscriptsubscriptsubscript¯𝐹𝑘𝑘1𝑚\{\mathcal{\hat{F}}_{k}\}_{k=1}^{m}=\Phi(\{{\overline{F}}_{k}\}_{k=1}^{m}){ over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = roman_Φ ( { over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) and (ii) ground truth feature tensors {k}k=1m=Φ({Fk}k=1m)superscriptsubscriptsubscript𝑘𝑘1𝑚Φsuperscriptsubscriptsubscript𝐹𝑘𝑘1𝑚\{\mathcal{F}_{k}\}_{k=1}^{m}=\Phi(\{{F}_{k}\}_{k=1}^{m}){ caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = roman_Φ ( { italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT )2:Output: Gromov-Wasserstein distance GWλ𝐺subscript𝑊𝜆GW_{\lambda}italic_G italic_W start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT3:Hyperparameters: λ>0𝜆0\lambda>0italic_λ > 0, projection iterations P, Sinkhorn iterations S4:Initialize: πkl(0)=1n,k,lsuperscriptsubscript𝜋𝑘𝑙01𝑛for-all𝑘𝑙\pi_{kl}^{(0)}=\frac{1}{n},\forall k,litalic_π start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG , ∀ italic_k , italic_l, m=ji𝑚𝑗𝑖m=j-iitalic_m = italic_j - italic_i5: Cost matrix for generated feature tensors Dij=1(^i,^j)subscript𝐷𝑖𝑗subscript1subscript^𝑖subscript^𝑗D_{ij}=\mathcal{L}_{1}(\mathcal{\hat{F}}_{i},\mathcal{\hat{F}}_{j})italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )6: Cost matrix for ground truth feature tensors Eij=1(i,j)subscript𝐸𝑖𝑗subscript1subscript𝑖subscript𝑗E_{ij}=\mathcal{L}_{1}(\mathcal{F}_{i},\mathcal{F}_{j})italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )7:fort = 1:Pdo8:initialize a tree Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with only a leaf (the root);9:C=1mE2𝟙m𝟙mT+1m𝟙m𝟙mTD22Eπ(t1)DT𝐶1𝑚superscript𝐸2subscript1𝑚superscriptsubscript1𝑚𝑇1𝑚subscript1𝑚superscriptsubscript1𝑚𝑇superscript𝐷22𝐸superscript𝜋𝑡1superscript𝐷𝑇C=\frac{1}{m}E^{2}\mathbbm{1}_{m}\mathbbm{1}_{m}^{T}+\frac{1}{m}\mathbbm{1}_{m%}\mathbbm{1}_{m}^{T}D^{2}-2E\pi^{(t-1)}D^{T}italic_C = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG blackboard_1 start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_E italic_π start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT;10:K=e(C/λ)𝐾superscript𝑒𝐶𝜆K=e^{(-C/\lambda)}italic_K = italic_e start_POSTSUPERSCRIPT ( - italic_C / italic_λ ) end_POSTSUPERSCRIPT;11:b(0)=𝟙msuperscript𝑏0subscript1𝑚b^{(0)}=\mathbbm{1}_{m}italic_b start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = blackboard_1 start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT;12:forl = 1:Sdo13:a(l)=𝟙mKb(l1)superscript𝑎𝑙subscript1𝑚𝐾superscript𝑏𝑙1a^{(l)}=\mathbbm{1}_{m}\oslash Kb^{(l-1)}italic_a start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = blackboard_1 start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊘ italic_K italic_b start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT;14:b(l)=𝟙mKTa(l)superscript𝑏𝑙subscript1𝑚superscript𝐾𝑇superscript𝑎𝑙b^{(l)}=\mathbbm{1}_{m}\oslash K^{T}a^{(l)}italic_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = blackboard_1 start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊘ italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT;15:# \oslash defines component-wise division16:endfor17:π(t)=diag(a(S))Kdiag(b(S))superscript𝜋𝑡𝑑𝑖𝑎𝑔superscript𝑎𝑆𝐾𝑑𝑖𝑎𝑔superscript𝑏𝑆\pi^{(t)}=diag(a^{(S)})Kdiag(b^{(S)})italic_π start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_d italic_i italic_a italic_g ( italic_a start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT ) italic_K italic_d italic_i italic_a italic_g ( italic_b start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT );18:endfor19:GWλ=i,j,k,lEikDjl2πij(P)πkl(P)𝐺subscript𝑊𝜆subscript𝑖𝑗𝑘𝑙superscriptnormsubscript𝐸𝑖𝑘subscript𝐷𝑗𝑙2superscriptsubscript𝜋𝑖𝑗𝑃superscriptsubscript𝜋𝑘𝑙𝑃GW_{\lambda}=\sum_{i,j,k,l}\left\|E_{ik}-D_{jl}\right\|^{2}\pi_{ij}^{(P)}\pi_{%kl}^{(P)}italic_G italic_W start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_j italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_P ) end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_P ) end_POSTSUPERSCRIPT

Perceptual Loss.  While the Gromov-Wasserstein loss facilitates the appearance generation in the presence of sparse training samples, another loss is introduced into the network to better maintain image reconstruction details. An intuitive approach is to utilize the mean squared error (MSE) loss to minimize the pixel-wise loss between the generated human foreground F¯¯𝐹\overline{F}over¯ start_ARG italic_F end_ARG and the ground truth F𝐹Fitalic_F:

MSE=FF¯22,subscript𝑀𝑆𝐸subscriptsuperscriptdelimited-∥∥𝐹¯𝐹22\displaystyle\begin{split}\mathcal{L}_{MSE}=\left\|F-\overline{F}\right\|^{2}_%{2},\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT = ∥ italic_F - over¯ start_ARG italic_F end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW(4)

where 2\left\|\cdot\right\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represents L2 norm. Nevertheless, MSE loss may produce blurry and distorted images or lead to ill-posed details [53]. Given this context, we adopt a perceptual reconstruction loss that constrains the generated F¯¯𝐹\overline{F}over¯ start_ARG italic_F end_ARG to approach ground-truth in the feature space:

p=Ψ(F)Ψ(F¯)22,subscript𝑝subscriptsuperscriptdelimited-∥∥Ψ𝐹Ψ¯𝐹22\displaystyle\begin{split}\mathcal{L}_{p}&=\left\|\Psi(F)-\Psi(\overline{F})%\right\|^{2}_{2},\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL start_CELL = ∥ roman_Ψ ( italic_F ) - roman_Ψ ( over¯ start_ARG italic_F end_ARG ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW(5)

where Ψ()Ψ\Psi(\cdot)roman_Ψ ( ⋅ ) represents a feature extraction network. Pixel-wise loss concentrates too much on the brightness of each pixel, while feature level loss considers more on the spatial consistency. Collectively, the Gromov-Wasserstein loss and perceptual loss together facilitate appearance generation.

Discriminator in Pose-to-Appearance Generation.  (1) Recalling previous approaches for motion copy, they typically employ a spatial discriminator that concentrates on the quality of each frame and fails to explicitly consider video continuity. (2) When we watch videos, we tend to take care of the quality of frames and continuity across frames. We believe it is crucial to jointly take into account spatial consistency and temporal continuity. Based on the two observations and heuristics above, we present a spatial-temporal dual constraint, consisting of a quality discriminator Dqsubscript𝐷𝑞D_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and a temporal discriminator Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specifically, (1) the quality discriminator Dqsubscript𝐷𝑞D_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT enforces the forged foreground image to approach the ground truth. (2) the temporal discriminator Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT captures the temporal information across frames using a set of parallel dilation convolutions. Dqsubscript𝐷𝑞D_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT takes (Pi,Fi)subscript𝑃𝑖subscript𝐹𝑖(P_{i},{F}_{i})( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) or (Pi,F¯i)subscript𝑃𝑖subscript¯𝐹𝑖(P_{i},{\overline{{F}}}_{i})( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as the input while Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT absorbs (Pt1t+1,Ft1t+1)superscriptsubscript𝑃𝑡1𝑡1superscriptsubscript𝐹𝑡1𝑡1(P_{t-1}^{t+1},{F}_{t-1}^{t+1})( italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) or (Pt1t+1,F¯t1t+1)superscriptsubscript𝑃𝑡1𝑡1superscriptsubscript¯𝐹𝑡1𝑡1(P_{t-1}^{t+1},{\overline{{F}}}_{t-1}^{t+1})( italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ). Note that Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT stands for pose of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame and F¯isubscript¯𝐹𝑖\overline{{F}}_{i}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the generated foreground for Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Pt1t+1superscriptsubscript𝑃𝑡1𝑡1P_{t-1}^{t+1}italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT and F¯t1t+1superscriptsubscript¯𝐹𝑡1𝑡1{\overline{{F}}}_{t-1}^{t+1}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT represent Pt1,Pt,Pt+1subscript𝑃𝑡1subscript𝑃𝑡subscript𝑃𝑡1\langle P_{t-1},P_{t},P_{t+1}\rangle⟨ italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⟩ and F¯t1,F¯t,F¯t+1subscript¯𝐹𝑡1subscript¯𝐹𝑡subscript¯𝐹𝑡1\langle\overline{{F}}_{t-1},\overline{{F}}_{t},\overline{{F}}_{t+1}\rangle⟨ over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⟩, respectively. Both of the two discriminators are trained to output binary labels, real or fake. Overall, the generator strives to create more lifelike videos to fool the dual discriminator, while the discriminator tries its best to distinguish between generated video and the ground truth. Model performance is iteratively optimized in a two-player min-max game fashion.

The image quality often appears imperfect when the fine-grained local details are missing. We scrutinized and implemented state-of-the-art methods following their released code and parameter settings [6, 13, 7, 9, 10]. A significant insight we gain from the experiments is that current methods still have difficulties in generating detailed face, natural hands, and clear feet. After obtaining the initial body appearance using the proposed pose-to-appearance GAN network, we further employ a self-supervised face enhancement component and multiple local GANs to polish the details of local parts.

Do as I Do: Pose Guided Human Motion Copy (4)

Self Supervised Face Enhancement with Vector Field. Intuitively, the face images of the same person with similar face orientations should look similar to each other. Therefore, we search from the given videos of the target person to identify face images that have similar face orientations as the synthesized image. In particular, we choose multiple images with the closest face orientations rather than using only one image that has the closest face orientation as the synthesized image, making it more robust to noise.

For the measurement of face orientation similarity, an intuitive approach is to compute the similarity between facial features. However, facial features usually convey too much information irrelevant to face orientation, e.g., color and eye shape. To tackle the problem, a viable method, as shown in Fig.4, is to represent the face orientation with a face vector field. As shown in Fig.4, we employ six vectors, including v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: right eye \rightarrow left eye, v2subscript𝑣2v_{2}italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: left eye \rightarrow nose, v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT: right eye\rightarrow nose, v4subscript𝑣4v_{4}italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT: right ear\rightarrow left ear, v5subscript𝑣5v_{5}italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: nose\rightarrow right ear, v6subscript𝑣6v_{6}italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT: nose\rightarrow left ear. Given two face orientations {vi}i=16superscriptsubscriptsubscript𝑣𝑖𝑖16\{v_{i}\}_{i=1}^{6}{ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT and {v^i}i=16superscriptsubscriptsubscript^𝑣𝑖𝑖16\{\hat{v}_{i}\}_{i=1}^{6}{ over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, their similarity can be conveniently computed as:

𝒮=1i=16v^ivi2𝒮1superscriptsubscript𝑖16subscriptnormsubscript^𝑣𝑖subscript𝑣𝑖2\displaystyle\mathcal{S}=\frac{1}{\sum_{i=1}^{6}\|\hat{v}_{i}-v_{i}\|_{2}}caligraphic_S = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG(6)

Subsequently, we choose top m𝑚mitalic_m real facial images 𝐟={f1,f2,,fm}𝐟subscript𝑓1subscript𝑓2subscript𝑓𝑚\mathbf{f}~{}=~{}\{{f_{1},f_{2},\dots,f_{m}}\}bold_f = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } with the largest similarity 𝒮msubscript𝒮𝑚\mathcal{S}_{m}caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as auxiliary faces. Finally, the generated face f𝑓fitalic_f is enhanced into:

f=α(i=1m(Sij=1mSj×fi)+βf\displaystyle f^{\prime}=\alpha(\sum_{i=1}^{m}(\frac{S_{i}}{\sum_{j=1}^{m}S_{j%}}\times f_{i})+\beta fitalic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_α ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( divide start_ARG italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG × italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_β italic_f(7)

where Sij=1mSjsubscript𝑆𝑖superscriptsubscript𝑗1𝑚subscript𝑆𝑗\frac{S_{i}}{\sum_{j=1}^{m}S_{j}}divide start_ARG italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG measures the weight of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT chosen face fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyperparameters. The process is also depicted in Fig.5.

Do as I Do: Pose Guided Human Motion Copy (5)

Multi-Local GANs. After enhancing the face, we further refine the face and limbs using multiple local GANs. In light of the divide-and-conquer strategy [14], we design multi-local GANs to refine key parts separately. Concretely, we clip the five key parts F¯isuperscript¯𝐹𝑖\overline{F}^{i}over¯ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (face, two hands, and two feet) from the generated foreground image F¯¯𝐹\overline{F}over¯ start_ARG italic_F end_ARG. We feed them into corresponding delicate GANs, which outputs a residual image F^risuperscriptsubscript^𝐹𝑟𝑖{\hat{F}_{r}}^{i}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, which learns the difference between the generated image of the body part and the ground truth (in terms of color and texture). Those residual images are added to F¯¯𝐹\overline{F}over¯ start_ARG italic_F end_ARG (the original foreground generation) to produce the final foreground:

F~i=F^ri+F¯isuperscript~𝐹𝑖superscriptsubscript^𝐹𝑟𝑖superscript¯𝐹𝑖\displaystyle\widetilde{F}^{i}={\hat{F}_{r}}^{i}+\overline{F}^{i}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + over¯ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT(8)

3.3 Episodic Memory for Experience Replay

For the pose-to-appearance generation, we adopt the theoretical Gromov-Wasserstein loss to mitigate the issue of insufficient training samples. Furthermore, inspired by lifelong learning, we introduce an episodic memory component for appearance generation, which propels continuous learning and the accumulation of past knowledge over a lifetime. More specifically, we store previous poor generations in the episodic memory and replay these poor generations periodically in training. This enforces the network to be able to consistently learn from its own mistakes and accumulate experiences. Interestingly, the mechanism is similar to our human brain that occasionally recaps significant moments recorded in our memory. The entire procedure of memory replay is formulated in Algorithm2. We may describe the high-level idea as:

1:Training2:Input: training samples Pt,Ftt=1Tsuperscriptsubscriptsubscript𝑃𝑡subscript𝐹𝑡𝑡1𝑇\left<P_{t},F_{t}\right>_{t=1}^{T}⟨ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, replay time interval K3:# Pisubscript𝑃𝑖{P}_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT stands for the desired pose for the target person in the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame, and Fisubscript𝐹𝑖{F}_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the corresponding appearance4:Output: generation model G5:forepoch = 1:Ndo6:if epoch mod K = 0then7:Sample m examples from M8:# M represents memory9:Calculate Gromov-Wasserstein loss and perceptual loss, and then perform backpropagation to update the parameters of G10:# Experience Replay11:endif12:fort = 1:Tdo13:Retrieve training samples Pt1t+1,Ft1t+1superscriptsubscript𝑃𝑡1𝑡1superscriptsubscript𝐹𝑡1𝑡1\left<P_{t-1}^{t+1},F_{t-1}^{t+1}\right>⟨ italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ⟩14:Calculate Gromov-Wasserstein loss and perceptual loss, and then perform backpropagation to update the parameters of G15:ifstore memorythen16:Write Pt1t+1,Ft1t+1superscriptsubscript𝑃𝑡1𝑡1superscriptsubscript𝐹𝑡1𝑡1\left<P_{t-1}^{t+1},F_{t-1}^{t+1}\right>⟨ italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ⟩ to memory M17:endif18:ifperceptual_loss>loss_threshold𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑢𝑎𝑙_𝑙𝑜𝑠𝑠𝑙𝑜𝑠𝑠_𝑡𝑟𝑒𝑠𝑜𝑙𝑑perceptual\_loss>loss\_thresholditalic_p italic_e italic_r italic_c italic_e italic_p italic_t italic_u italic_a italic_l _ italic_l italic_o italic_s italic_s > italic_l italic_o italic_s italic_s _ italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_dthen19:Write the poor generation examples Pt1t+1,Ft1t+1superscriptsubscript𝑃𝑡1𝑡1superscriptsubscript𝐹𝑡1𝑡1\left<P_{t-1}^{t+1},F_{t-1}^{t+1}\right>⟨ italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ⟩ into memory M20:endif21:endfor22:endfor23:Return G24:Inference25:Input: the poses Pt=1Tsuperscriptsubscript𝑃𝑡1𝑇P_{t=1}^{T}italic_P start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of source person 𝕊𝕊\mathbb{S}blackboard_S, the generation model G26:Output: the foreground F¯t=1Tsuperscriptsubscript¯𝐹𝑡1𝑇\overline{F}_{t=1}^{T}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT27:# the generated foreground (appearance) F¯¯𝐹\overline{F}over¯ start_ARG italic_F end_ARG for target person 𝕋𝕋\mathbb{T}blackboard_T28:fort in range(1:T:3)do29:F¯tt+2superscriptsubscript¯𝐹𝑡𝑡2\overline{F}_{t}^{t+2}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 2 end_POSTSUPERSCRIPT = G (Ptt+2superscriptsubscript𝑃𝑡𝑡2P_{t}^{t+2}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 2 end_POSTSUPERSCRIPT);30:endfor31:return F¯t=1Tsuperscriptsubscript¯𝐹𝑡1𝑇\overline{F}_{t=1}^{T}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

In the first epoch, we utilize all training samples to train the pose-to-appearance generation network (Algorithm 2 lines 12-14). We then select all the poorly generated samples (when the perceptual loss exceeds a threshold) and randomly select a few other samples, and put them into the episodic memory (Algorithm 2 lines 15-19). In the following epochs, we retrain all the training samples for the second time. During the second-time training, we randomly select several samples from the episodic memory and replay (retrain) them to update the parameters of the pose-to-appearance generation network per K epochs (Algorithm 2 lines 6-11).

Memory replay would keep the model from catastrophic forgetting, continuously improving the generated frames by learning from past poor generations. However, overfitting problems may arise if the training samples in the memory are revisited too frequently. Following [54], our memory replay is designed to be executed only occasionally.

3.4 Foreground and Background Fusion

Up to this point, we have obtained the polished foregroundF~~𝐹\widetilde{F}over~ start_ARG italic_F end_ARG. In the pre-processing phase, we have computed the mask matrix M𝑀Mitalic_M of the foreground in the image and have refilled foreground pixels in the backgroundB𝐵Bitalic_B following[48]. We utilize a linear sum to couple foregroundF~~𝐹\widetilde{F}over~ start_ARG italic_F end_ARG and backgroundB𝐵Bitalic_B.

I~=MF~+(1M)B~𝐼direct-product𝑀~𝐹direct-product1𝑀𝐵\displaystyle\tilde{I}=M\odot\widetilde{F}+(1-M)\odot Bover~ start_ARG italic_I end_ARG = italic_M ⊙ over~ start_ARG italic_F end_ARG + ( 1 - italic_M ) ⊙ italic_B(9)

3.5 Loss Functions

For the pose-to-appearance generator, we utilize the Gromov-Wasserstein and perceptual losses. Now, we zoom in the loss functions of the discriminator. We introduce a standard adversarial loss, where a quality discriminator Dqsubscript𝐷𝑞D_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT attempts to discern the real and generated frames:

q=𝔼(P,F)[logDq((P,F)]+𝔼P[log(1Dq(P,F¯))]\displaystyle\begin{split}\mathcal{L}_{q}&=\mathbb{E}_{({P,F})}[\log D_{q}((P,%F)]\\&+\mathbb{E}_{P}[\log(1-D_{q}(P,\overline{F}))]\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT ( italic_P , italic_F ) end_POSTSUBSCRIPT [ roman_log italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( ( italic_P , italic_F ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_P , over¯ start_ARG italic_F end_ARG ) ) ] end_CELL end_ROW(10)

We additionally propose a temporal consistency loss to ensure the temporal smoothness of the generated video:

t=𝔼(P,F)[logDt(Pt1t+1,Ft1t+1)]+𝔼P[log(1Dt(Pt1t+1,F¯t1t+1)]\displaystyle\begin{split}&\mathcal{L}_{t}=\mathbb{E}_{({P,F})}[\log D_{t}(P_{%t-1}^{t+1},{F}_{t-1}^{t+1})]\\&+\mathbb{E}_{P}[\log(1-D_{t}(P_{t-1}^{t+1},{\overline{{F}}}_{t-1}^{t+1})]\end%{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_P , italic_F ) end_POSTSUBSCRIPT [ roman_log italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ] end_CELL end_ROW(11)

where Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a temporal discriminator which tries to distinguish the real frame sequence Ft1t+1superscriptsubscript𝐹𝑡1𝑡1{F}_{t-1}^{t+1}italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT and fake sequence F¯t1t+1superscriptsubscript¯𝐹𝑡1𝑡1{\overline{{F}}}_{t-1}^{t+1}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT.

Do as I Do: Pose Guided Human Motion Copy (6)
Do as I Do: Pose Guided Human Motion Copy (7)

4 Experiments

In this section, we first present the experimental settings and the details of the five benchmark datasets, iPER [9], ComplexMotion [55], SoloDance [10] Fish dataset [56], and Mouse dataset [57]. Then, we introduce evaluation metrics for motion copy and compare our method with the state-of-the-art approaches. Further, we investigate the effects of different components in our framework. Finally, we try adapting our method to other articulated objects including fish and mice.Briefly, we seek to answer the following research questions.

  • RQ1: How is the proposed method compared to state-of-the-art methods on human motion copy?

  • RQ2: Is our method able to synthesize motion videos with attractive details for a target person?

  • RQ3: How much do different components of our method contribute to the performance?

  • RQ4: How well does the proposed method generalize to animals such as fish and mouse?

Next, we introduce the experimental settings and empirically investigate the research questions one by one.

4.1 Experimental Settings

4.1.1 Datasets

Experiments are conducted on five benchmark datasets, iPER [9], ComplexMotion [55], SoloDance [10], Fish dataset [56], and Mouse dataset [57].

iPER. For human motion copy, we experiment on iPER [9] dataset, which contains 30 persons with different shapes, heights, and genders. A person may wear different outfits, and there are 103 outfits in total. The dataset contains 241,564 frames from 206 videos. Within the videos, different actions includingarm exercise, stretching exercise, standing and reaching, leaping, swimming, taichi, chest mobility exercise, leg stretching, squat, and leg-raising are involved.

ComplexMotion. We also conduct experiments on ComplexMotion [55] dataset, which contains rapid and complex motions of more than 50505050 persons.The videos are collected from various video platforms such as Tiktok111https://www.tiktok.com and Youtube222https://www.youtube.com.In particular, ComplexMotion consists of 68,320 frames from 122 videos. Within the videos, persons wear various clothes and perform complex movements such as street dance, sports, and kung fu.

SoloDance Dataset We further conduct experiments on SoloDance [10] dataset, which contains 179 dance videos with 53,700 frames. Specifically, 143 human subjects were captured with each wearing different clothes and performing complex dances (e.g., modern, street dances) in various scenes.

Fish dataset. For motion copy from a fish to another fish, we utilize the Fish dataset [56], which contains 14 fish videos of 6 different fishes. Specifically, each video consists of 2,250 to 24,000 frames.

Mouse dataset. For mouse motion copy, we use the Mouse dataset [57], which includes 12 mouse videos of 4 mice. Mouse depth images were captured at 25 FPS with the top-view Primesense Carmine camera. The 3D poses of the mouse are extracted from the depth image using annotation tool of [57]. Then, we project the 3D poses to the 2D plane, and obtain the 2D poses of the mice. Specifically, the number of frames in each video varies from 500 to 30,000.

4.1.2 Implementation details

We utilize PyTorch 1.4.0 to realize our proposed framework. We train the FakeVideo framework independently on iPER and ComplexMotion. During training, all the frames are resized to 512 ×\times× 512. We utilize OpenPose to detect 18 human joints in each frame of the dataset. For Fish and Mouse datasets, we would like to point out that we do not further enhance the local details of the generated fish and mice, since they are small in body size and the pose-to-appearance generation seems to be sufficient to yield realistic fish and mice. We employ Mask-RCNN to disentangle the foreground sequence (body) and the background sequence from a video. We utilize the pre-trained VGGNet [58] as our frame feature extractor, which consists of 16 convolutional layers and 3 fully connected layers. The output of the 16thsuperscript16𝑡16^{th}16 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT convolutional layer is the extracted feature, which is used for perceptual loss. We train our model for 120 epochs on a server with NVIDIA GeForce RTX 2080 Ti GPUs.

MethodsComplexMotioniPER
Image ReconstructionMotion ImitationImage ReconstructionMotion Imitation
SSIM\uparrowPSNR\uparrowLPIPS\downarrowFID\downarrowIS\uparrowTCM\uparrowSSIM\uparrowPSNR\uparrowLPIPS\downarrowFID\downarrowIS\uparrowTCM \uparrow
EDN [6]0.82324.360.06164.123.4110.5340.85224.480.08657.523.3050.591
FSV2V [7]0.74822.510.13299.113.1640.5750.82421.180.108107.293.1360.754
PoseWarp [13]0.71121.420.14978.213.1090.3340.79222.160.119115.233.0950.601
LWGAN [9]0.78924.270.08185.303.3980.6830.84322.320.09176.383.2580.729
C2F-FWN [10]0.87825.680.04853.193.4080.6890.84724.320.07460.123.4120.769
FakeMotion [55]0.88327.150.04048.033.5430.7730.85625.860.06856.273.4610.799
FakeVideo (Ours)0.89627.520.03246.623.7280.8130.86826.720.04954.943.5820.872

MethodsComplexMotioniPER
Image ReconstructionMotion ImitationImage ReconstructionMotion Imitation
SSIM\uparrowPSNR\uparrowLPIPS\downarrowFID\downarrowIS\uparrowSSIM\uparrowPSNR\uparrowLPIPS\downarrowFID\downarrowIS\uparrow
Our method, Complete0.89627.520.03246.623.7280.86826.720.04954.943.582
Ablation study onDense Skip Connections
r/m dense skip connections0.86825.280.05062.393.2180.81324.210.07562.123.271
Ablation study onSelf-Supervised Face Enhancement
Image feature0.70322.420.12983.443.2150.63422.140.10899.343.019
2 face vectors0.72824.680.12978.203.1080.71921.340.11489.513.167
3 face vectors0.75825.620.07956.803.33180722.560.08973.333.267
4 face vectors0.78426.080.08859.713.3040.75321.520.09875.283.261
5 face vectors0.88327.150.04048.033.5430.85625.860.06856.273.461
1 candidate face0.73224.880.13975.203.1580.68920.240.10488.513.067
2 candidate faces0.79326.220.08360.913.3710.74622.720.08875.543.321
Ablation study on Multiple Local GANs
r/m multi-local GAN0.87226.190.05361.493.3200.84824.190.07863.223.373

4.2 Comparison with State-of-the-art Methods on Multiple Metrics (RQ1)

In this section, we compare our proposed approach with existing state-of-the-art approaches, which include:

  • EDN (Everybody dance now) [6]: A well-known pose-guided method for human motion copy, which makes amateurs dance like ballerinas.

  • C2F-FWN (Coarse-to-fine flow warping network) [10]: A novel motion copy method, which warps the layout based on the transformation flow.

  • FakeMotion [55]: A motion copy approach, which generates human appearance with optimal transport theory and polishes the local body parts with multiple local GANs.

  • FSV2V (Few-shot video2video) [7]: A high-resolution and few-shot video generation method which is applicable to motion copy, facial expression transformation, etc.

  • LWGAN (Liquid warping GAN) [9]: A unified warping framework which implements human motion copy, appearance (clothes) transfer, and novel view generation.

  • PoseWarp [13]: A motion copy method for sport scenes. In the method, 3D poses rather than 2D poses are utilized as the motion intermediary, which provide the spatial characteristics of a motion.

To quantitatively compare the performance of our method and existing approaches, we divide the applications into two scenes: Image Reconstruction and Motion Imitation. For Image Reconstruction, we perform self-mimicry experiments in which persons imitate actions from themselves. In other words, we feed the pose skeleton of a subject into the network and output the human image of the same subject. We adopt Structural Similarity (SSIM) [59] as a low-level metric, Peak Signal to Noise Ratio (PSNR) and Learned Perceptual Image Patch Similarity (LPIPS) [60] as the perceptual level metrics to evaluate the quality of the generated image sequence.For Motion Imitation, we perform cross-mimicry where persons imitate the movements of others. Put differently, we input the pose skeleton of a subject into the network and output the human image of another subject.We utilize the Inception Score (IS) [61] and Frechet Inception Distance score (FID) [62] to examine the differences between the generated images and the ground truth images. In addition, following the method in [10], we employ Temporal Consistency Metric [63] to measure the temporal continuity of the generated video. The experimental results on the ComplexMotion and iPER datasets are summarized in TableI. From the table, we have the following observations. (1) Among the existing methods, C2F-FWN [10] and FakeMotion [55] achieve the current state-of-the-art performance on both the two datasets. (2) FakeVideo is able to outperform state-of-the-art approaches in both Image Reconstruction and Motion Imitation. For example, in image reconstruction and motion imitation, FakeVideo gains 7.2% and 12.4% improvements on PSNR and FID metrics respectively. The significant performance improvements suggest the potential of FakeVideo to perform motion copy. The experimental results on the SoloDance datasets are summarized in TableIII. From the table, we have the following observations. (1) Among the existing methods, C2F-FWN [10] achieves the current state-of-the-art performance on SoloDance datasets. (2) FakeVideo is able to outperform state-of-the-art approaches. For example, FakeVideo gains 4.4% and 4% improvements on PSNR and FID metrics respectively. The performance improvements suggest the potential of FakeVideo to perform motion copy.

{tblr}

cells = c,vline2-6 = -,hline1-2,9 = -,Methods & SSIM PSNR LPIPS FID TCM
EDN [6] 0.811 23.22 0.051 53.17 0.347
FSV2V [7] 0.721 20.84 0.132 112.99 0.106
PoseWarp [13] 0.692 19.80 0.147 120.13 0.102
LWGAN [9] 0.786 20.87 0.106 86.53 0.176
C2F-FWN [10] 0.879 26.65 0.049 46.49 0.641
FakeVideo (Ours) 0.893 27.82 0.038 44.72 0.739

4.3 Visual Comparison with State-of-the-art Methods (RQ2)

Further, we visualize the generated results of state-of-the-art approaches on three datasets, which are depicted in Fig.6. Empirically, PoseWarp [13] and FSV2V [7] may result in distorted body shapes and absent limbs. We conjecture the reasons are that they do not consider fusing information across multiple scales, leading to inevitable information dropping in the generation process. EDN [6] achieves realistic visual results. However, the generated human faces of EDN usually have blurred facial parts. LWGAN [9] and C2F-FWN [10] could effectively copy the motions according to the optical flow, however, they have difficulties in generating fine-grained clothes and hairs. In contrast, our method yields a more realistic human body and plausible local details. More visual results of our method are demonstrated in Fig.7. We see that our method consistently generates realistic frames.

As shown in Fig.6, the first column (Target Person) illustrates the target person, the second column (Source Person) demonstrates the source person, the third column presents desired poses, which are obtained from videos of the source person (not the target person), and the remaining columns represent the generated frames of the target person. As shown in Fig.6 and Fig.7, we would like to clarify that the source person and the target person are not the same individual, they are different in faces, body shapes, clothes, and even in gender. Fig.6 and Fig.7 contain multiple source and target persons pairs from three datasets.

Complex MotioniPER
SSIM\uparrowPSNR\uparrowLPIPS\downarrowSSIM\uparrowPSNR\uparrowLPIPS\downarrow
Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT0.83825.100.0580.82024.110.081
LGWsubscript𝐿𝐺𝑊L_{GW}italic_L start_POSTSUBSCRIPT italic_G italic_W end_POSTSUBSCRIPT0.88327.150.0400.85625.860.068
Lp+LGWsubscript𝐿𝑝subscript𝐿𝐺𝑊L_{p}+L_{GW}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_G italic_W end_POSTSUBSCRIPT0.89627.520.0320.86826.720.049
Complex MotioniPER
SSIM\uparrowPSNR\uparrowLPIPS\downarrowSSIM\uparrowPSNR\uparrowLPIPS\downarrow
w/o memory0.89227.380.0360.86026.380.051
with memory0.89627.520.0320.86826.720.049

4.4 Study on Key Components of FakeVideo (RQ3)

In this subsection, we study the effects of the different components of our method. With this goal in mind, we tried (1) removing dense skip connections from the pose-to-appearance generation GAN, (2) utilizing different kinds of loss functions in the generation network, (3) removing memory module from our framework, (4) using different face enhancement strategies, and (5) removing multiple local GANs which are responsible for local enhancement.

Removing dense skip connections from the pose-to-appearance GAN. We first investigate the effect of removing the dense skip connections in the pose-to-appearance GAN. From TableII, we observe that the performance of the method degrades significantly upon the removal of dense skip connections. This is consistent with our intuition that the dense skip connections could integrate multi-level latent features and is able to access lower-level pose details, contributing to better performance.

Utilizing different kinds of loss functions. Then, we examine the effects of different kinds of loss functions. In the pose-to-appearance generation GAN, we employ a perceptual loss psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and a Gromov-Wasserstein Loss GWsubscriptGW\mathcal{L}_{\textit{GW}}caligraphic_L start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT. To study the impacts of the two loss functions, we conduct comparing experiments on using psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, GWsubscriptGW\mathcal{L}_{\textit{GW}}caligraphic_L start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT, and the combined losses p+GWsubscript𝑝subscriptGW\mathcal{L}_{p}+\mathcal{L}_{\textit{GW}}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT, respectively. The empirical results are elaborated in TableIV. The combined losses (i.e., p+GWsubscript𝑝subscriptGW\mathcal{L}_{p}+\mathcal{L}_{\textit{GW}}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT) achieves the best performance, while using GWsubscriptGW\mathcal{L}_{\textit{GW}}caligraphic_L start_POSTSUBSCRIPT GW end_POSTSUBSCRIPT alone yields better results than using psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT alone.

Removing memory module from our framework. In order to evaluate the effectiveness of our memory module, we further try removing the episodic memory module from our framework. As shown in TableV, with the memory module, our approach achieves 0.14 and 0.24 higher PSNR scores than its counterpart without the memory module on the two datasets. These evidences show that the memory module does play an important role in boosting the generation quality of our method.

Using different face enhancement strategies. For self-supervised face enhancement, we consider two schemes to select similar face images for the generated face: facial similarity computed using VGG image features and using the proposed face vector field. The results are demonstrated in TableII. Empirically, the face vector field strategy achieves significantly better performance, which is in accordance with our intuition. The image features of faces actually emphasize the appearance information (i.e., colors and eye shapes) while ignoring face orientation information,which has difficulties in effectively selecting similar face images to enhance the generated face. Our face vector field strategy is capable of accurately representing more fine-grained face orientation details, which is conducive to selecting similar face images that are more valuable to compensate for the facial details of the generated face.

We also examine the influence on the number of face vectors, as shown in TableII. We observe that the quality of the generated image gradually increases with the growing number of face vectors. We also ablate the number of similar face images selected, as shown in TableII. We find that three images might provide sufficient facial information, resulting in informed self-supervised face enhancement.

Do as I Do: Pose Guided Human Motion Copy (8)

Removing local GANs from the local enhancement module. Finally, we remove the multiple local GANs to examine their contributions. As shown in TableII, FID significantly increases from 48.0348.0348.0348.03 to 61.4961.4961.4961.49 upon the removal of the local GANs. This dramatic image quality degeneration highlights the effectiveness of the local GANs in the local refinement. Particularly, the residual images of human body parts generated by the local GANs reveal the difference in color and texture details between the generated image of the body part and the ground truth, facilitating the generation of more lifelike local images.

4.5 Motion Copy on Other Articulated Objects (RQ4)

In addition to performing motion copy on humans, we are curious whether our approach could be generalized to other articulated objects including zebra fish and mouse. To this end, we further conduct experiments on fish [56] and mouse [57] datasets. Interestingly, our method could be adapted to copy motions of fish and mouse. Empirical results are demonstrated in Fig.8. Take fish as an example, in the training stage, we first employ Lie-X [56] to detect the desired poses of the fish from given videos. Then, we disentangle the frames of the video of the target fish into foreground and background using Mask-RCNN [47]. Thereafter, we feed the desired poses into our pose-to-appearance generation network, where the network architecture remains the same but with the feature size adapted to fit fish. We would like to point out that we do not enhance the details of the generated frames, since a zebra fish is small in body size and the pose-to-appearance generation seems to be sufficient to yield realistic fish frames. Finally, we couple the generated foreground and the background, obtaining the entire fish videos. In the inference stage, the network is fed with the desired poses from another fish and we could synthesize a lively video of the target fish, where the target fish swims and acts like another fish. Experiments on fish and mouse show that our method is able to copy motions of other articulated objects.

4.6 Discussion about the Computational Time Comparison

Motion transfer models could be classified into two categories: dedicated-purpose models and general-purpose models. Specifically, the dedicated-purpose models excel in generating fake videos of a specific person and offer high video quality at the expense of longer training time. In contrast, general-purpose models have the capability to generate fake videos of any person, necessitating less training time but yielding less satisfactory generation results compared to dedicated-purpose models. In this paper, we concentrate on dedicated-purpose models. We would like to emphasize that despite the longer training time required for our dedicated-purpose model, it offers a shorter inference time. Empirical results are represented in Table VI. Specifically, during the inference phase, EDN [6] achieves an average Frames Per Second (FPS) of 14.29, [55] achieves an average FPS of 15, and our method has an average FPS of 25.25.

MethodsEDNFakeMotionFakeVideo (Ours)
FPS14.291525.25

4.7 User study

We conduct a user study, engaging a cohort of 25 volunteers, to meticulously evaluate the quality of the generated results. Each participant is presented with six clusters of generated results, and subsequently requested to discern and designate the best results within each group. Ultimately, we collect a total of 25 responses, the results are shown in Fig. 9. We can see that the proposed FakeVideo gets the highest rating and significantly outperforms other methods (EDN [6], PoseWarp [13], FSV2V [7], C2F-FWN [10], and LWGAN [38]).

Do as I Do: Pose Guided Human Motion Copy (9)

4.8 Failure Case Analysis

Do as I Do: Pose Guided Human Motion Copy (10)

Interestingly, we also observed a few failure cases. The failure cases are shown in Figure 10. The first failure case is shown in Fig. 10 (a). When the hands of the source character are undetectable, the generated hands are not realistic enough. The second failure case is shown in Fig. 10 (b). If there are artifacts in the input source image, such as elongated arms and lower legs, this will result in a target image with missing forearms and misaligned lower legs.In the subsequent work, we plan to implement the following measures to further improve the generation:

(1) We will develop a more accurate human pose estimation framework, which plays an important role in the task of human motion copy.

(2) Additionally, we will enhance the network structure. Specifically, in cases where a generated limb of the synthesized frame is missing, the network will strive to generate a limb that aligns with the target person, thereby ensuring a more coherent output.

5 Conclusion

In this work, we present a novel approach FakeVideo for motion copy. The crucial ingredient is proposing a pose-to-appearance generation network with Gromov-Wasserstein and perceptual losses, and a memory module that consistently learns from its past poor generations. We further introduce a self-supervised face enhancement module that resorts to face frames with similar orientations to polish facial details of the generated face. Interestingly, our approach could be generalized to other articulated objects, including fish and mouse. Extensive empirical results on five datasets, iPER, ComplexMotion, SoloDance, fish and mouse datasets, demonstrate the efficacy of the proposed method.

References

  • [1]V.Visch, E.Tan, and D.P. Saakes, “Viewer knowledge: Application ofexposure-based layperson knowledge in genre-specific animation production,”International Journal of Design, vol.9, no.1, pp. 83–89, 2015.
  • [2]T.-Y. Mou, “Creative story design method in animation production pipeline,”in DS79: Proceedings of The Third International Conference on DesignCreativity, Indian Institute of Science, Bangalore, 2015.
  • [3]J.Carmigniani, B.Furht, M.Anisetti, P.Ceravolo, E.Damiani, and M.Ivkovic,“Augmented reality technologies, systems and applications,”Multimedia tools and applications, vol.51, no.1, pp. 341–377, 2011.
  • [4]Y.Siriwardhana, P.Porambage, M.Liyanage, and M.Ylianttila, “A survey onmobile augmented reality with 5g mobile edge computing: architectures,applications, and technical aspects,” IEEE Communications Surveys &Tutorials, vol.23, no.2, pp. 1160–1192, 2021.
  • [5]S.Tulyakov, M.-Y. Liu, X.Yang, and J.Kautz, “Mocogan: Decomposing motionand content for video generation,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2018, pp. 1526–1535.
  • [6]C.Chan, S.Ginosar, T.Zhou, and A.A. Efros, “Everybody dance now,” inProceedings of the IEEE/CVF international conference on computervision, 2019, pp. 5933–5942.
  • [7]T.-C. Wang, M.-Y. Liu, A.Tao, G.Liu, J.Kautz, and B.Catanzaro, “Few-shotvideo-to-video synthesis,” arXiv preprint arXiv:1910.12713, 2019.
  • [8]M.Ghafoor, K.Javed, and A.Mahmood, “Walk like me: Video to video actiontransfer,” IEEE TRANSACTIONS ON MULTIMEDIA, p.1, 2022.
  • [9]W.Liu, Z.Piao, J.Min, W.Luo, L.Ma, and S.Gao, “Liquid warping gan: Aunified framework for human motion imitation, appearance transfer and novelview synthesis,” in Proceedings of the IEEE/CVF InternationalConference on Computer Vision, 2019, pp. 5904–5913.
  • [10]D.Wei, X.Xu, H.Shen, and K.Huang, “C2F-FWN: Coarse-to-fine flow warpingnetwork for spatial-temporal consistent motion transfer,” inProceedings of the AAAI Conference on Artificial Intelligence,vol.35, no.4, 2021, pp. 2852–2860.
  • [11]D.Joo, D.Kim, and J.Kim, “Generating a fusion image: One’s identity andanother’s shape,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2018, pp. 1635–1643.
  • [12]Z.Huang, X.Han, J.Xu, and T.Zhang, “Few-shot human motion transfer bypersonalized geometry and texture modeling,” in Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.2297–2306.
  • [13]G.Balakrishnan, A.Zhao, A.V. Dalca, F.Durand, and J.Guttag, “Synthesizingimages of humans in unseen poses,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2018, pp. 8340–8348.
  • [14]Z.Liu, K.Lyu, S.Wu, H.Chen, Y.Hao, and S.Ji, “Aggregated multi-gans forcontrolled 3d human motion prediction,” in Proceedings of the AAAIConference on Artificial Intelligence, vol.35, no.3, 2021, pp. 2225–2232.
  • [15]D.P. Kingma and M.Welling, “Auto-encoding variational bayes,” arXivpreprint arXiv:1312.6114, 2013.
  • [16]A.VanOord, N.Kalchbrenner, and K.Kavukcuoglu, “Pixel recurrent neuralnetworks,” in International conference on machine learning.PMLR, 2016, pp. 1747–1756.
  • [17]H.Zheng, J.Chen, H.Du, W.Zhu, S.Ji, and X.Zhang, “Grip-gan: Anattack-free defense through general robust inverse perturbation,” IEEETransactions on Dependable and Secure Computing, 2021.
  • [18]J.Zhang, K.Li, Y.-K. Lai, and J.Yang, “Pise: Person image synthesis andediting with decoupled gan,” in Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, 2021, pp. 7982–7990.
  • [19]I.Goodfellow, J.Pouget-Abadie, M.Mirza, B.Xu, D.Warde-Farley, S.Ozair,A.Courville, and Y.Bengio, “Generative adversarial nets,” Advancesin neural information processing systems, vol.27, 2014.
  • [20]P.Isola, J.-Y. Zhu, T.Zhou, and A.A. Efros, “Image-to-image translationwith conditional adversarial networks,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2017, pp. 1125–1134.
  • [21]H.Zhang, T.Xu, H.Li, S.Zhang, X.Wang, X.Huang, and D.N. Metaxas,“Stackgan: Text to photo-realistic image synthesis with stacked generativeadversarial networks,” in Proceedings of the IEEE internationalconference on computer vision, 2017, pp. 5907–5915.
  • [22]L.Gao, D.Chen, Z.Zhao, J.Shao, and H.T. Shen, “Lightweight dynamicconditional gan with pyramid attention for text-to-image synthesis,”Pattern Recognition, vol. 110, p. 107384, 2021.
  • [23]J.Cao, Y.Hu, B.Yu, R.He, and Z.Sun, “3d aided duet gans for multi-viewface image synthesis,” IEEE Transactions on Information Forensics andSecurity, vol.14, no.8, pp. 2028–2042, 2019.
  • [24]Y.He, J.Zhang, H.Shan, and L.Wang, “Multi-task gans for view-specificfeature learning in gait recognition,” IEEE Transactions onInformation Forensics and Security, vol.14, no.1, pp. 102–113, 2018.
  • [25]J.Yang, D.Ruan, J.Huang, X.Kang, and Y.-Q. Shi, “An embedding costlearning framework using gan,” IEEE Transactions on InformationForensics and Security, vol.15, pp. 839–851, 2019.
  • [26]S.Shehnepoor, R.Togneri, W.Liu, and M.Bennamoun, “Scoregan: A fraud reviewdetector based on regulated gan with data augmentation,” IEEETransactions on Information Forensics and Security, vol.17, pp. 280–291,2021.
  • [27]M.Arjovsky and L.Bottou, “Towards principled methods for training generativeadversarial networks,” arXiv preprint arXiv:1701.04862, 2017.
  • [28]I.Gulrajani, F.Ahmed, M.Arjovsky, V.Dumoulin, and A.C. Courville,“Improved training of wasserstein gans,” Advances in neuralinformation processing systems, vol.30, 2017.
  • [29]L.Ma, X.Jia, Q.Sun, B.Schiele, T.Tuytelaars, and L.VanGool, “Poseguided person image generation,” Advances in neural informationprocessing systems, vol.30, 2017.
  • [30]A.Pumarola, A.Agudo, A.Sanfeliu, and F.Moreno-Noguer, “Unsupervised personimage synthesis in arbitrary poses,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2018, pp. 8620–8628.
  • [31]T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G.Liu, A.Tao, J.Kautz, and B.Catanzaro,“Video-to-video synthesis,” arXiv preprint arXiv:1808.06601, 2018.
  • [32]A.Bansal, S.Ma, D.Ramanan, and Y.Sheikh, “Recycle-gan: Unsupervised videoretargeting,” in Proceedings of the European conference on computervision (ECCV), 2018, pp. 119–135.
  • [33]P.Esser, E.Sutter, and B.Ommer, “A variational u-net for conditionalappearance and shape generation,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2018, pp. 8857–8866.
  • [34]J.Ren, M.Chai, S.Tulyakov, C.Fang, X.Shen, and J.Yang, “Human motiontransfer from poses in the wild,” in European Conference on ComputerVision.Springer, 2020, pp. 262–279.
  • [35]C.Xu, Y.Fu, C.Wen, Y.Pan, Y.-G. Jiang, and X.Xue, “Pose-guided personimage synthesis in the non-iconic views,” IEEE Transactions on ImageProcessing, vol.29, pp. 9060–9072, 2020.
  • [36]Z.Yang, W.Zhu, W.Wu, C.Qian, Q.Zhou, B.Zhou, and C.C. Loy, “Transmomo:Invariance-driven unsupervised video motion retargeting,” inProceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition, 2020, pp. 5306–5315.
  • [37]H.Dong, X.Liang, K.Gong, H.Lai, J.Zhu, and J.Yin, “Soft-gatedwarping-gan for pose-guided person image synthesis,” Advances inneural information processing systems, vol.31, 2018.
  • [38]W.Liu, Z.Piao, Z.Tu, W.Luo, L.Ma, and S.Gao, “Liquid warping gan withattention: A unified framework for human image synthesis,” IEEETransactions on Pattern Analysis and Machine Intelligence, 2021.
  • [39]A.Shysheya, E.Zakharov, K.-A. Aliev, R.Bashirov, E.Burkov, K.Iskakov,A.Ivakhnenko, Y.Malkov, I.Pasechnik, D.Ulyanov etal., “Texturedneural avatars,” in Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, 2019, pp. 2387–2397.
  • [40]L.Liu, W.Xu, M.Habermann, M.Zollhöfer, F.Bernard, H.Kim, W.Wang, andC.Theobalt, “Neural human video rendering by learning dynamic textures andrendering-to-video translation,” arXiv preprint arXiv:2001.04947,2020.
  • [41]X.Han, X.Hu, W.Huang, and M.R. Scott, “Clothflow: A flow-based model forclothed person generation,” in Proceedings of the IEEE/CVFinternational conference on computer vision, 2019, pp. 10 471–10 480.
  • [42]S.Liu, Y.Li, and G.Hua, “Human pose estimation in video via structuredspace learning and halfway temporal evaluation,” IEEE Transactions onCircuits and Systems for Video Technology, vol.29, no.7, pp. 2029–2038,2018.
  • [43]M.Ghafoor and A.Mahmood, “Quantification of occlusion handling capability of3d human pose estimation framework,” IEEE Transactions on Multimedia,2022.
  • [44]S.Aftab, S.F. Ali, A.Mahmood, and U.Suleman, “A boosting framework forhuman posture recognition using spatio-temporal features along with radontransform,” Multimedia Tools and Applications, vol.81, no.29, pp.42 325–42 351, 2022.
  • [45]Z.Cao, T.Simon, S.-E. Wei, and Y.Sheikh, “Realtime multi-person 2d poseestimation using part affinity fields,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2017, pp. 7291–7299.
  • [46]Z.Liu, H.Chen, R.Feng, S.Wu, S.Ji, B.Yang, and X.Wang, “Deep dualconsecutive network for human pose estimation,” in Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.525–534.
  • [47]K.He, G.Gkioxari, P.Dollár, and R.Girshick, “Mask r-cnn,” inProceedings of the IEEE international conference on computer vision,2017, pp. 2961–2969.
  • [48]J.Yu, Z.Lin, J.Yang, X.Shen, X.Lu, and T.S. Huang, “Free-form imageinpainting with gated convolution,” in Proceedings of the IEEE/CVFinternational conference on computer vision, 2019, pp. 4471–4480.
  • [49]Y.Cheng, C.Xu, Z.Hai, and Y.Li, “Deepmnemonic: Password mnemonicgeneration via deep attentive encoder-decoder model,” IEEETransactions on Dependable and Secure Computing, 2020.
  • [50]Y.Wang, D.J. Tan, N.Navab, and F.Tombari, “Softpool++: An encoder–decodernetwork for point cloud completion,” International Journal of ComputerVision, vol. 130, no.5, pp. 1145–1164, 2022.
  • [51]G.Peyré, M.Cuturi, and J.Solomon, “Gromov-wasserstein averaging ofkernel and distance matrices,” in International Conference on MachineLearning.PMLR, 2016, pp. 2664–2672.
  • [52]S.Wu, Z.Liu, S.Lu, and L.Cheng, “Dual learning music composition and dancechoreography,” in Proceedings of the 29th ACM International Conferenceon Multimedia, 2021, pp. 3746–3754.
  • [53]Q.Yang, P.Yan, Y.Zhang, H.Yu, Y.Shi, X.Mou, M.K. Kalra, and G.Wang,“Low dose ct image denoising using a generative adversarial network withwasserstein distance and perceptual loss,” 2017.
  • [54]C.deMassonD’Autume, S.Ruder, L.Kong, and D.Yogatama, “Episodic memory inlifelong language learning,” Advances in Neural Information ProcessingSystems, vol.32, 2019.
  • [55]Z.Liu, S.Wu, C.Xu, X.Wang, L.Zhu, S.Wu, and F.Feng, “Copy motion fromone to another: Fake motion video generation,” arXiv preprintarXiv:2205.01373, 2022.
  • [56]C.Xu, L.N. Govindarajan, Y.Zhang, and L.Cheng, “Lie-x: Depth image basedarticulated object pose estimation, tracking, and action recognition on liegroups,” International Journal of Computer Vision, vol. 123, no.3,pp. 454–478, 2017.
  • [57]Z.Liu, S.Wu, S.Jin, Q.Liu, S.Ji, S.Lu, and L.Cheng, “Investigating poserepresentations and motion contexts modeling for 3d motion prediction,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [58]K.Simonyan and A.Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [59]Z.Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli, “Image qualityassessment: from error visibility to structural similarity,” IEEEtransactions on image processing, vol.13, no.4, pp. 600–612, 2004.
  • [60]R.Zhang, P.Isola, A.A. Efros, E.Shechtman, and O.Wang, “The unreasonableeffectiveness of deep features as a perceptual metric,” in Proceedingsof the IEEE conference on computer vision and pattern recognition, 2018, pp.586–595.
  • [61]T.Salimans, I.Goodfellow, W.Zaremba, V.Cheung, A.Radford, and X.Chen,“Improved techniques for training gans,” Advances in neuralinformation processing systems, vol.29, 2016.
  • [62]M.Heusel, H.Ramsauer, T.Unterthiner, B.Nessler, and S.Hochreiter, “Ganstrained by a two time-scale update rule converge to a local nashequilibrium,” Advances in neural information processing systems,vol.30, 2017.
  • [63]C.-H. Yao, C.-Y. Chang, and S.-Y. Chien, “Occlusion-aware video temporalconsistency,” in Proceedings of the 25th ACM international conferenceon Multimedia, 2017, pp. 777–785.
Do as I Do: Pose Guided Human Motion Copy (11)Sifan Wu is currently pursuing a ph.D. degree at Jilin University. He received his B.E. and M.E. degrees from Hebei GEO University and Zhejiang Gongshang University, respectively. His research interests include computer vision, motion copy, and pose estimation.
Do as I Do: Pose Guided Human Motion Copy (12)Zhenguang Liu is currently a research professor of Zhejiang University. He had been a research fellow in National University of Singapore and A*STAR. (Agency for Science, Technology and Research, Singapore). He respectively received his Ph.D. and B.E. degrees from Zhejiang University and Shandong University, China. His research interests include multimedia data analysis and smart contract security. Various parts of his work have been published in first-tier venues including PAMI, ACM CCS, CVPR, ICCV, TKDE, TIP, WWW, TDSC, AAAI, ACM MM, INFOCOM, IJCAI, etc. Dr. Liu has served as technical program committee member for top-tier conferences such as CVPR, ICCV, WWW, AAAI, IJCAI, ACM MM, session chair of ICGIP, local chair of KSEM, and reviewer for IEEE PAMI, IEEE TVCG, IEEE TPDS, IEEE TIP, ACM TOMM, IEEE MM, etc.
Do as I Do: Pose Guided Human Motion Copy (13)Beibei Zhang received his B.Sc. (Hons) degree in Information Technology from The Hong Kong Polytechnic University in 2017 and his M.Eng. degree from the Department of Electrical & Computer Engineering, University of Toronto, in 2020. He is currently a Research Engineer in Zhejiang Lab, Hangzhou, China. His research interests include distributed systems, cloud computing, and peer-to-peer networks.
Do as I Do: Pose Guided Human Motion Copy (14)Zhongjie Ba received the Ph.D. in Computer Science and Engineering from the State University of New York at Buffalo, USA, in 2019. He is currently a Professor with the School of Cyber Science and Technology, College of Computer Science and Technology, Zhejiang University, China. He was a Postdoctoral Researcher in the School of Computer Science at McGill University, Canada. His current research interests include the security and privacy aspects of Internet of Things, forensic analysis of multimedia content, and privacy-enhancing technologies in the context of collaborative deep learning. Results have been published in peer reviewed top conferences and journals, including CCS, NDSS, INFOCOM, ICDCS, and IEEE Trans. Inf. Forensics Security. Currently, Zhongjie Ba serves as an Associate Editor of IEEE Internet of Things Journal and the technical program committee of several conferences in the field of Internet of Things and wireless communication.
Do as I Do: Pose Guided Human Motion Copy (15)Roger Zimmermann (M’93–SM’07) received the M.S. and Ph.D. degrees from the University ofSouthern California, Los Angeles, USA, in 1994 and 1998, respectively. He is currently an Associate Professor with the Department of Computer Science, National University of Singapore (NUS), Singapore, where he is also the Deputy Director with the Smart Systems Institute, and co-directed the Centre of Social Media Innovations for Communities. He has co-authored a book, seven patents, and over 200 conference publications, journal articles, and book chapters. His research interests include streaming media architectures, distributed systems, mobile and geo-referenced video management, collaborative environments, spatio-temporal information management, and mobile location-based services. He is a Senior Member of the IEEE and a Distinguished Member of the ACM.
Do as I Do: Pose Guided Human Motion Copy (16)Xiaosong Zhang received the B.S. degree in dynamics engineering from Shanghai Jiao Tong University, Shanghai, China, in 1990, and the M.S. and Ph.D. degrees in computer science from the University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2011.,He is currently a Professor with the School of Computer Science and Engineering, UESTC. He has worked on numerous projects in both research and development roles. These projects include device security, intrusion detection, malware analysis, software testing, and software verification. He has coauthored a number of research articles on computer security. His current research interests include software reliability, software vulnerability discovering, software test case generation, and reverse engineering.
Do as I Do: Pose Guided Human Motion Copy (17)Kui Ren (Fellow, IEEE, Fellow, ACM) received the PhD degree from Worcester Polytechnic Institute. He is currently a professor and an associate dean with the College of Computer Science and Technology, Zhejiang University, where he also directs the Institute of Cyber Science and Technology. Before that, he was the SUNY Empire Innovation professor with the State University of New York at Buffalo. He has authored or coauthored extensively in peer-reviewed journals and conferences. His research interests include data security, IoT security, AI security, and privacy. His h-index is 74 and the total publication citation exceeds 32000 according to Google Scholar. He was the recipient of Guohua Distinguished Scholar Award from ZJU in 2020, IEEE CISTC Technical Recognition Award in 2017, SUNY Chancellor’s Research Excellence Award in 2017, Sigma Xi Research Excellence Award in 2012, NSF CAREER Award in 2011,Test-of-time Paper Award from IEEE INFOCOM, and many best paper awards from the IEEE and ACM, including the MobiSys’20, ICDCS’20, Globecom’19, ASIACCS’18, and ICDCS’17. He is a distinguished member of the ACM and a clarivate highly-cited researcher. He is a frequent reviewer of funding agencies internationally and was on the editorial boards of many IEEE and ACM journals. He is the chair of the SIGSAC of ACM China.
Do as I Do: Pose Guided Human Motion Copy (2024)

References

Top Articles
Latest Posts
Article information

Author: Kerri Lueilwitz

Last Updated:

Views: 5909

Rating: 4.7 / 5 (67 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Kerri Lueilwitz

Birthday: 1992-10-31

Address: Suite 878 3699 Chantelle Roads, Colebury, NC 68599

Phone: +6111989609516

Job: Chief Farming Manager

Hobby: Mycology, Stone skipping, Dowsing, Whittling, Taxidermy, Sand art, Roller skating

Introduction: My name is Kerri Lueilwitz, I am a courageous, gentle, quaint, thankful, outstanding, brave, vast person who loves writing and wants to share my knowledge and understanding with you.