Do as I Do: Pose Guided Human Motion Copy (2024)

Sifan Wu,Zhenguang Liu,Beibei Zhang,Roger Zimmermann,,Zhongjie Ba,Xiaosong Zhang,andKui Ren,Sifan Wu is with the School of Computer Science and Technology, Jilin University, Changchun 130015, China (email: wusifan2021@gmail.com).Zhenguang Liu (Corresponding author), Zhongjie Ba, and Kui Ren are professors of School of Cyber Science and Technology, Zhejiang University, Hangzhou 310058, China (email: liuzhenguang2008@gmail.com, {zhongjieba,kuiren}@zju.edu.cn).Beibei Zhang is with Zhejiang Lab, Hangzhou, Zhejiang Province, 311121, China (e-mail: bzeecs@gmail.com).Roger Zimmermann is a professor of School of Computing, National University of Singapore, 119613, Singapore (e-mail: rogerz@comp.nus.edu.sg).Xiaosong Zhang is with the Center for Cyber Security, the College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China (e-mail: johnsonzxs@uestc.edu.cn).

Abstract

Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers.

Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss tobridge the gap between pose and appearance.(2) We present an episodic memory module in the pose-to-appearance generationto propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face tooptimize facial details and refine each key body part with a dedicated local GAN.(3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.

Index Terms:

Motion copy, deep fake, Gromov-Wasserstein, fake video.

1 Introduction

The seismic breakthrough of artificial intelligence has given rise to numerous intriguing and appealing video applications. A compelling application is to copy the motion from a source person to a target person, generating a fake video of the target person enacting the same motion as the source person.Motion copy empowers an untrained person to be depicted in videos dancing like a professional dancer, acting like a Kung Fu star, and playing basketball like an NBA player. Correspondingly, motion copy finds its applications in a wide spectrum of scenarios includinganimation production[1, 2], augmented reality[3, 4], and social media entertainment[5]. Interestingly, the source and target persons might be greatly different in body shape, appearance, and race.

Fundamentally, motion copy amounts to learning a mapping from the given video of a source person to the target video of a target person, as shown in Fig.1.The task is inherently challenging due to the high dimensionality of the mappingand subtle motion details to be generated. Technically, each frame of the target fake video comprises millions of pixels. Even a few wrong pixels are highly noticeable to human observers.

Generally, motion copy is carried out in two steps. In the first step, the pose or mesh sequence of the source person is extracted from the source video. In the second step, motion copy learns a generative model that maps the intermediate representation (pose or mesh sequence) to the appearance of the target person, synthesizing the fake video where the target person enacts the motion of the source person.One line of works extracts human poses as the intermediate representation,which are referred to as pose-guided methods[6, 7, 8].Another line of works captures human body meshes as the intermediate representation,which are termed as warping-guided methods[9, 10]. Recently, a few approaches advocate to transfer motion directly in the image feature space[11] or introduce neural rendering techniques to reconstruct human templatesfrom static images[12].In this paper, we focalize pose-guided target video generation, in view of its efficiency and robustness to cloth deformation.

Do as I Do: Pose Guided Human Motion Copy (1)

Upon investigating and experimenting on the released implementations ofstate-of-the-art methods[6, 13, 7, 9, 10], we empirically observe the following issues:(1) Current pose-to-appearance generation models primarily hinge on either L1 or L2 loss to train a GANthat bridges the gap between a pose and its target appearance.Such GANs necessitate a large number of training samples.Nevertheless, we often have only one or a few videos of the target person for training.(2) Whereas existing methods achieve plausible results on a broad stroke, theissues of distorted faces, hands, and feet are quite rampant. The high-fidelity textures of the face, hands, and feet, which either require sufficient details or have flexible movements, are usually missing.(3) Most existing methods generate each frame independently,ignoring the fact that adjacent frames are closely related to each other.This usually leads to temporal inconsistency in the generated video.

In this paper, we embrace three key designs to tackle the challenges.(1) We augment our pose-to-appearance GAN with a theoretically motivated Gromov-Wasserstein loss and a perceptual loss,which alleviates the problem of scarce training samples and attains realistic results.(2) We propose an episodic memory module in the pose-to-appearance generation so that the model continuously accumulate experience from its past poor generations. We also utilize geometrical cues of the face tooptimize facial details and refine each key body part with a dedicated local GAN.(3) We instill spatial coherency and temporal consistency into our generated video by designing a spatial-temporal discriminator.

Interestingly, most existing methods typically focus on fake human motion generation. In this paper, we explore applying our approach to a range of objects including humans, fish, and mice. Extensive experiments are conducted on benchmark datasets including iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets. Empirically, our approach outperforms state-of-the-art approaches by a large margin (7.2% and 12.4% gain in PSNR and FID) in fake video generation.

To summarize, the key contributions of this work are:

•
We investigate the novel framework of incorporating Gromov-Wasserstein loss and perceptual loss for pose-to-appearance generation, which encodespairwise distance constraints and attains realistic results.
•
In light of the divide-and-conquer strategy [14], we polish the local regions of key body parts including face, hands, and feet separately with dedicated local GANs. We empirically present a new vector field incorporating ears to characterize the face orientation, which serves to identify frames with similar face orientations to enhance the generated face.
•
Extensive experiments show that our approach achieves state-of-the-art performance. Besides, our approach could be generalized to other articulated objects, including fish and mouse.

We would like to share that this paper is the continuation of our earlier work “Copy Motion From One to Another: Fake Motion Video Generation” published in IJCAI 2022 [55], which is accepted as a Long Presentation paper at an acceptance rate of 3.75% (the paper acceptance rate is 15%, Long Presentation papers are papers that rank top 25% among the accepted papers). This work is distinct from the conference version paper in four aspects. (1) Unlike our earlier work, which generates each frame independently in a sequence-to-frame framework, this work generates $k$ consecutive foreground frames simultaneously with a sequence-to-sequence framework, encoding the wealth of temporal context information. (2) In this paper, we propose a novel episodic memory component that stores the poor generations of the model and replays these samples occasionally to enforce the model continuously learning from its defects. (3) To capture the orientation of the human face, in contrast to the mouth vector on the face employed in our earlier work, we experimentally discover that the geometric information from the ears vectors on the face is more stable and significant. Inspired by this, we present a new vector field to characterize the face orientation. (4) This work consistently outperforms the earlier work on iPER and ComplexMotion datasets and provides more insights and findings in human motion copy. Significantly, our earlier work focuses on only human motion. In this paper, we explore applying our approach to a range of objects including humans, fish, and mice.

The remainder of the paper is organized as follows. In Section2, we give a brief introduction to the related work of image synthesis and human motion copy. Thereafter, we elaborate on the proposed method in Section3. In Section4, we present the experiments and performance analysis. Finally, we conclude the paper in Section5.

2 Related Work

Before diving into the details of our approach, let us first review and categorize the related works on motion copy. We first recap the holistic view of image synthesis, which provides a broader range of research pertinent to human motion copy.We then present the hitherto human motion copy approaches, which can be cast into three categories, namely pose-guided human motion copy, warping-guided human motion copy, and no-intermediary human motion copy.

2.1 Image Synthesis

Earlier research resorts to Variational Autoencoder[15] and Auto-Regressive models[16] for image synthesis. Recently, the proposal and application [17, 18] of Generative Adversarial Networks (GANs)[19] have led to great advancement in image generation. Technically, GANs utilize a generator-discriminator architecture, where the generator produces images and the discriminator distinguishes between real and fake images. The generator and discriminator are iteratively optimized in a two-player min-max game. Conditional GANs synthesize the images under a given conditional input (e.g., class labels). Isola et al.[20] consider the conditional GAN as a general solution to accomplish image synthesis tasks such as image reconstruction, style transfer, and image coloring. Rather than generating a vanilla image,[21, 18] propose a two-stage GAN to produce a high-resolution image. Upon initiating photo-realistic images, Gao L et al.[22] propose a lightweight network structure that contains a generator and two discriminators to generate two images with different sizes in a feed-forward process. GANs have made remarkable progress in recent years on many tasks [23, 24, 25, 26]. However, it is well known that GANs are difficult to train and the training process is usually unstable. Towards an easy-to-train and stable GAN, Arjovsky et al.[27, 28] propose a WGAN that introduces a novel Wasserstein loss. Their proposed Wasserstein metric has a superior smoothing property compared to KL divergence of GANs, which can theoretically solve the gradient vanishing problem. A drawback of these methods lies in requiring a considerable amount of samples to train a model, which might limit their applications in human motion copy where we may not have a large number of training samples available.

2.2 Human Motion Copy

Existing approaches for human motion copy can be roughly categorized into three groups, namely pose-guided, warping-guided, and no-intermediary methods.

Pose-Guided Human Motion Copy. [29] is the first seminal work of human motion copy, which proposes a two-stage detailed generation from coarse to fine. Since then, a great deal of research has been conducted on human motion copy.Pumarola et al. [30, 31, 32] employ generators and discriminators to reconstruct the target person image with arbitrary poses. Esser et al. [33, 13] propose a unique conditional U-Net, which regulates the output of the variant auto-encoder on appearance. However, these researches are extremely reliant on large-scale training samples, which is difficult to fulfill in practical applications. Ren et al. [34, 35] achieve great image quality with posture augmentation and novel image refinement. Ghafoor et al. [8] proposed a novel video-to-video action transfer framework, which consists of a cascaded sequence of action transfer block with multi-resolution structure similarity loss. Yang et al. [36] perform human video motion transfer in an unsupervised manner, which utilizes the invariance of three orthogonal variation factors, including motion, structure, and view. Nonetheless, these methods fail to take into account the importance of maintaining facial details during the transfer process of human motion. Although Chan et al. [6] introduce a face enhancement module, due to the overfitting problem of GAN, it is not effective in generating satisfactory faces. In contrast, our body parts enhancement polishes the generated face with a self-supervised training scheme and refines the key body portion using dedicated local GANs.

Warping-Guided Human Motion Copy. Dong et al.[37, 9, 38] disentangle the human image into action and appearance, and then perform motion imitation by a warping GAN that distorts the image according to reference poses. Similar to the above method, Shysheya et al. [39, 40] introduce an attention mechanism between pose skeleton and image to generate UV coordinates and then warps patch-level human texture maps to adapt the UV coordinates. However, these methods are limited by the diversity of texture maps, resulting in blurs and artifacts of the generated video. Han et al. [41] focus on learning an appearance flow that warps the clothing of a target person to the corresponding area of the source person. Wei et al. [10] warp the motion of the target human image and then refine the details. Nevertheless, the warping-based motion copy method, by nature, has difficulties in coping with rapid human motion. Moreover, these methods disregard the temporal consistency across frames, resulting in discontinuous video and visual artifacts.

No-Intermediary Human Motion Copy. There are also attempts that direct their efforts at motion copy without any intermediaries (i.e., poses or meshes). Joo et al. [11] employ two specific losses to constrain the GAN which generates a fusion image (one’s identity with another’s shape). However, it deeply concentrates on upper body motion copy (without legs and feet) and eye style transfer. In contrast, our model does not only achieve human whole-body motion copy but also boldly tries motion copy between animals. To the best of our knowledge, we are the first to replicate movements in other articulated objects of the same species, including fish and mice.

Do as I Do: Pose Guided Human Motion Copy (2)

3 Our Method

Problem Formulation. Broadly, given two videos, one video for the target person whose appearance we would like to synthesize and the other video for the source person whose actions we would like to copy[6], we are interested in generating a fake video of the target person performing the same actions as the source person.

Method Overview. An overview of our method FakeVideo is outlined in Fig.2. Overall, FakeVideo consists of four key components: (1) The pose extraction module draws out the human poses from the video of the source person, where the poses serve as motion copy intermediaries. The foreground and background separation module segments the video of the target person into foreground (i.e. human body) sequence and background sequence. (2) The pose-to-appearance GAN generates an appearance sequence for the target person from the extracted pose sequence. The local enhancement module is further engaged to polish the local regions of key body parts (face, hands, and feet). (3) The episodic memory component stores the poor generations of the model and replays these samples occasionally to enforce the model continuously learns from its own defects. (4) The foreground and background fusion module generates a fake video by fusing the polished foreground sequence and the background sequence. We would like to highlight that our generator has an edge in adopting Gromov-Wasserstein and perceptual losses while being equipped with memory components. Meanwhile, our discriminator games in spatial and temporal dual constraints, driving the generator to approach better generations. In what follows, we elaborate on the four key components in detail.

3.1 Pose Extraction and Foreground-Background Separation

Pose Extraction. The goal of motion copy is to learn a mapping between a given video of the source person and the target video of the target person. Unfortunately, each frame of the two videos has millions of pixels, making it extremely difficult to acquire the mapping directly. Inspired by the rapid development of pose estimation techniques[42, 43, 44], we utilize pose skeleton sequence as the intermediary for motion copy. The pose sequence unambiguously indicates the motions and can be used to guide body appearance generation. To this end, we shift to learn a mapping from the poses to the body appearance sequence. Particularly, we adopt pre-trained pose detectors OpenPose[45] and DCPose [46] to extract poses from videos.

Do as I Do: Pose Guided Human Motion Copy (3)

Foreground and Background Separation. The pose skeleton clearly characterizes the motion, however, we believe it is too ambitious to synthesize a full frame (foreground and background) directly conditioned on a desired pose. Instead, an important step of our pipeline is to compute a mask matrix $M$ , which is leveraged to explicitly disentangle each video frame into foreground and background. We devise a generator to concentrate on synthesizing only the foreground sequence from poses. This facilitates our model to avoid considering a large number of background pixels in the pose-to-appearance generation, resulting in a more realistic appearance of the human and faster convergence for the network. Specifically, we adopt the off-the-shelf Mask-RCNN [47] to obtain the mask matrix $M$ . In addition, we employ image inpainting technology[48] to fill the removed foreground pixels in the background.

3.2 Pose-to-appearance Generation and Local Enhancement

Now, we consider how to generate an appealing body foreground sequence upon a given pose sequence. Technically, we design a pose-to-appearance generation GAN (appearance GAN), consisting of a generator that incorporates perceptual loss and Gromov-Wasserstein loss, and a discriminator that exerts spatio-temporal dual constraints.

Dense Skip Connections in Generator. The structure of the generator is illustrated in Fig.3, where we engage in a U-shaped architecture with multiple encoder-decoder layers. In conventional U-Net, a decoder layer solely connects to one symmetry encoder layer [49, 50]. These relatively isolated relationships between different level encoder-decoder layers lead to insufficient spatial information modeling in the encoding and decoding process. Explicitly, during the encoding process of conventional U-Net architecture, consecutive convolutions in the encoder would inevitably drop some low-level detailed features. To tackle the challenge, we devise dense skip connections in the U-shaped architecture. Our motivation is to preserve rich features from multiple levels rather than using only one level feature in the foreground generation. Therefore, as shown in Fig.3, instead of connecting a decoder at layer $i$ with only the symmetric encoder at layer $i$ , we add extra skip connections from the encoders at layers $\{1,2,\cdots,i-1\}$ to the decoder at layer $i$ . For example, decoder layer De-layer4 not only receives the feature information from the hop connection of encoder layer En-layer4 (as in the conventional U-Net), but also receives the feature information from encoder layers {En-layer1, En-layer2, En-layer3}. In this way, each decoder could integrate multi-level latent features and is able to access lower-level features.

Gromov-Wasserstein Loss and Perceptual Loss to Facilitate Appearance Generation. In the training phase, we extract the pose sequence and foreground sequence from the video of the target person, and train our generator network to learn how to capture the mapping function from the pose sequence to the corresponding foreground sequence of the target person. Existing methods typically address this pose-to-appearance problem with a conventional GAN, and measure the discrepancy between the generated foreground frame and the ground truth frame via a pixel-wise L2 loss. Such approaches, by nature, require a large number of training samples to reach convergence. To alleviate this issue, we propose a Gromov-Wasserstein loss that preserves the distance-structure of the feature space instead of the conventional pixel-wise L2 loss. Particularly, the Gromov-Wasserstein loss enforces that the generated fake frames should have the same feature distance structure as their corresponding ground truth frames. Put differently, if two ground truth frames $F_{i}$ and $F_{j}$ are close to each other in the image feature space, the generated fake frames for them should also be close to each other. Conversely, if $F_{i}$ and $F_{j}$ are far apart in the image feature space, the generated fake frames for them should also be far apart. In this way, we are able to train the network in a pairwise manner, where the training samples are multiplied. Besides the Gromov-Wasserstein loss, we also add a perceptual loss that further forces the generated frame to be consistent with the ground truth frame.

Formally, given a pose sequence $\langle P_{1},P_{2},\cdots,P_{m}\rangle$ , the pose-to-appearance generation network synthesizes a foreground sequence $\langle\overline{F}_{1},\overline{F}_{2},\cdots,\overline{F}_{m}\rangle$ . Specifically, we denote the feature tensors of $\langle\overline{F}_{1},\overline{F}_{2},\cdots,\overline{F}_{m}\rangle$ as $\langle\mathcal{\hat{F}}_{1},\mathcal{\hat{F}}_{2},\cdots,\mathcal{\hat{F}}_{m}\rangle$ , and the feature tensors of the corresponding ground truth sequence $\langle F_{1},F_{2},\cdots,F_{m}\rangle$ as $\langle\mathcal{F}_{1},\mathcal{F}_{2},\cdots,\mathcal{F}_{m}\rangle$ . Mathematically,

\displaystyle\{\mathcal{\hat{F}}_{k}\}_{k=1}^{m}=\Phi(\{{\overline{F}}_{k}\}_{%k=1}^{m}),\quad\{\mathcal{F}_{k}\}_{k=1}^{m}=\Phi(\{{F}_{k}\}_{k=1}^{m})

(1)

where $\Phi(\cdot)$ represents a pre-trained feature extraction backbone network. Heuristically, we show in Fig.3 that optimizing the Gromov-Wasserstein loss amounts to aligning the two groups of feature tensors so that the generated fake images preserve the distance structure of their corresponding ground truth images. We could view $\{\mathcal{\hat{F}}_{k}\}_{k=1}^{m}$ and $\{\mathcal{F}_{k}\}_{k=1}^{m}$ as discrete empirical distributions $\mu$ and $\tau$ , which is given by:

\displaystyle\begin{split}\mu=\sum_{k=1}^{m}\frac{1}{m}\delta_{\mathcal{\hat{F%}}_{k}},\quad\tau=\sum_{k=1}^{m}\frac{1}{m}\delta_{\mathcal{F}_{k}}\end{split}

(2)

where $\delta_{(\cdot)}$ represents the Dirac delta distribution. Then, the Gromov-Wasserstein loss for our model can be formulated as:

\mathcal{L}_{GW(\mu,\tau)}=\min_{\pi\in\Pi}\sum_{i,j,k,l}\left|\left\|\mathcal%{\hat{F}}_{i}-\mathcal{\hat{F}}_{k}\right\|_{1}-\left\|\mathcal{F}_{j}-%\mathcal{F}_{l}\right\|_{1}\right|^{2}\pi_{ij}\pi_{kl}

(3)

where $\Pi$ denotes a collection of point distributions with margins $\mu$ and $\tau$ . The optimal transport matrix $\pi$ could be calculated by minimizing the square distance with L1 costs in the intra-space.

Inspired by [51, 52], an entropy regularization term is introduced to ensure tractability and reversible backpropagation in the optimal transport loss optimization. In addition, we utilize the Sinkhorn algorithm and the projected gradient descent method [51] to solve the entropy-regularized Gromov-Wasserstein loss. Technically, the process of optimizing Gromov-Wasserstein loss is outlined in Algorithm1.

1:Input: (i) generated feature tensors $\{\mathcal{\hat{F}}_{k}\}_{k=1}^{m}=\Phi(\{{\overline{F}}_{k}\}_{k=1}^{m})$ and (ii) ground truth feature tensors $\{\mathcal{F}_{k}\}_{k=1}^{m}=\Phi(\{{F}_{k}\}_{k=1}^{m})$ 2:Output: Gromov-Wasserstein distance $GW_{\lambda}$ 3:Hyperparameters: $\lambda>0$ , projection iterations P, Sinkhorn iterations S4:Initialize: $\pi_{kl}^{(0)}=\frac{1}{n},\forall k,l$ , $m=j-i$ 5: Cost matrix for generated feature tensors $D_{ij}=\mathcal{L}_{1}(\mathcal{\hat{F}}_{i},\mathcal{\hat{F}}_{j})$ 6: Cost matrix for ground truth feature tensors $E_{ij}=\mathcal{L}_{1}(\mathcal{F}_{i},\mathcal{F}_{j})$ 7:fort = 1:Pdo8:initialize a tree $T_{i}$ with only a leaf (the root);9: $C=\frac{1}{m}E^{2}\mathbbm{1}_{m}\mathbbm{1}_{m}^{T}+\frac{1}{m}\mathbbm{1}_{m%}\mathbbm{1}_{m}^{T}D^{2}-2E\pi^{(t-1)}D^{T}$ ;10: $K=e^{(-C/\lambda)}$ ;11: $b^{(0)}=\mathbbm{1}_{m}$ ;12:forl = 1:Sdo13: $a^{(l)}=\mathbbm{1}_{m}\oslash Kb^{(l-1)}$ ;14: $b^{(l)}=\mathbbm{1}_{m}\oslash K^{T}a^{(l)}$ ;15:# $\oslash$ defines component-wise division16:endfor17: $\pi^{(t)}=diag(a^{(S)})Kdiag(b^{(S)})$ ;18:endfor19: $GW_{\lambda}=\sum_{i,j,k,l}\left\|E_{ik}-D_{jl}\right\|^{2}\pi_{ij}^{(P)}\pi_{%kl}^{(P)}$

Perceptual Loss. While the Gromov-Wasserstein loss facilitates the appearance generation in the presence of sparse training samples, another loss is introduced into the network to better maintain image reconstruction details. An intuitive approach is to utilize the mean squared error (MSE) loss to minimize the pixel-wise loss between the generated human foreground $\overline{F}$ and the ground truth $F$ :

\displaystyle\begin{split}\mathcal{L}_{MSE}=\left\|F-\overline{F}\right\|^{2}_%{2},\end{split}

(4)

where $\left\|\cdot\right\|_{2}$ represents L2 norm. Nevertheless, MSE loss may produce blurry and distorted images or lead to ill-posed details [53]. Given this context, we adopt a perceptual reconstruction loss that constrains the generated $\overline{F}$ to approach ground-truth in the feature space:

\displaystyle\begin{split}\mathcal{L}_{p}&=\left\|\Psi(F)-\Psi(\overline{F})%\right\|^{2}_{2},\end{split}

(5)

where $\Psi(\cdot)$ represents a feature extraction network. Pixel-wise loss concentrates too much on the brightness of each pixel, while feature level loss considers more on the spatial consistency. Collectively, the Gromov-Wasserstein loss and perceptual loss together facilitate appearance generation.

Discriminator in Pose-to-Appearance Generation. (1) Recalling previous approaches for motion copy, they typically employ a spatial discriminator that concentrates on the quality of each frame and fails to explicitly consider video continuity. (2) When we watch videos, we tend to take care of the quality of frames and continuity across frames. We believe it is crucial to jointly take into account spatial consistency and temporal continuity. Based on the two observations and heuristics above, we present a spatial-temporal dual constraint, consisting of a quality discriminator $D_{q}$ and a temporal discriminator $D_{t}$ . Specifically, (1) the quality discriminator $D_{q}$ enforces the forged foreground image to approach the ground truth. (2) the temporal discriminator $D_{t}$ captures the temporal information across frames using a set of parallel dilation convolutions. $D_{q}$ takes $(P_{i},{F}_{i})$ or $(P_{i},{\overline{{F}}}_{i})$ as the input while $D_{t}$ absorbs $(P_{t-1}^{t+1},{F}_{t-1}^{t+1})$ or $(P_{t-1}^{t+1},{\overline{{F}}}_{t-1}^{t+1})$ . Note that $P_{i}$ stands for pose of the $i^{th}$ frame and $\overline{{F}}_{i}$ denotes the generated foreground for $P_{i}$ . $P_{t-1}^{t+1}$ and ${\overline{{F}}}_{t-1}^{t+1}$ represent $\langle P_{t-1},P_{t},P_{t+1}\rangle$ and $\langle\overline{{F}}_{t-1},\overline{{F}}_{t},\overline{{F}}_{t+1}\rangle$ , respectively. Both of the two discriminators are trained to output binary labels, real or fake. Overall, the generator strives to create more lifelike videos to fool the dual discriminator, while the discriminator tries its best to distinguish between generated video and the ground truth. Model performance is iteratively optimized in a two-player min-max game fashion.

The image quality often appears imperfect when the fine-grained local details are missing. We scrutinized and implemented state-of-the-art methods following their released code and parameter settings [6, 13, 7, 9, 10]. A significant insight we gain from the experiments is that current methods still have difficulties in generating detailed face, natural hands, and clear feet. After obtaining the initial body appearance using the proposed pose-to-appearance GAN network, we further employ a self-supervised face enhancement component and multiple local GANs to polish the details of local parts.

Do as I Do: Pose Guided Human Motion Copy (4)

Self Supervised Face Enhancement with Vector Field. Intuitively, the face images of the same person with similar face orientations should look similar to each other. Therefore, we search from the given videos of the target person to identify face images that have similar face orientations as the synthesized image. In particular, we choose multiple images with the closest face orientations rather than using only one image that has the closest face orientation as the synthesized image, making it more robust to noise.

For the measurement of face orientation similarity, an intuitive approach is to compute the similarity between facial features. However, facial features usually convey too much information irrelevant to face orientation, e.g., color and eye shape. To tackle the problem, a viable method, as shown in Fig.4, is to represent the face orientation with a face vector field. As shown in Fig.4, we employ six vectors, including $v_{1}$ : right eye $\rightarrow$ left eye, $v_{2}$ : left eye $\rightarrow$ nose, $v_{3}$ : right eye $\rightarrow$ nose, $v_{4}$ : right ear $\rightarrow$ left ear, $v_{5}$ : nose $\rightarrow$ right ear, $v_{6}$ : nose $\rightarrow$ left ear. Given two face orientations $\{v_{i}\}_{i=1}^{6}$ and $\{\hat{v}_{i}\}_{i=1}^{6}$ , their similarity can be conveniently computed as:

\displaystyle\mathcal{S}=\frac{1}{\sum_{i=1}^{6}\|\hat{v}_{i}-v_{i}\|_{2}}

(6)

Subsequently, we choose top $m$ real facial images $\mathbf{f}~{}=~{}\{{f_{1},f_{2},\dots,f_{m}}\}$ with the largest similarity $\mathcal{S}_{m}$ as auxiliary faces. Finally, the generated face $f$ is enhanced into:

\displaystyle f^{\prime}=\alpha(\sum_{i=1}^{m}(\frac{S_{i}}{\sum_{j=1}^{m}S_{j%}}\times f_{i})+\beta f

(7)

where $\frac{S_{i}}{\sum_{j=1}^{m}S_{j}}$ measures the weight of the $i^{th}$ chosen face $f_{i}$ , $\alpha$ and $\beta$ are hyperparameters. The process is also depicted in Fig.5.

Do as I Do: Pose Guided Human Motion Copy (5)

Multi-Local GANs. After enhancing the face, we further refine the face and limbs using multiple local GANs. In light of the divide-and-conquer strategy [14], we design multi-local GANs to refine key parts separately. Concretely, we clip the five key parts $\overline{F}^{i}$ (face, two hands, and two feet) from the generated foreground image $\overline{F}$ . We feed them into corresponding delicate GANs, which outputs a residual image ${\hat{F}_{r}}^{i}$ , which learns the difference between the generated image of the body part and the ground truth (in terms of color and texture). Those residual images are added to $\overline{F}$ (the original foreground generation) to produce the final foreground:

\displaystyle\widetilde{F}^{i}={\hat{F}_{r}}^{i}+\overline{F}^{i}

(8)

3.3 Episodic Memory for Experience Replay

For the pose-to-appearance generation, we adopt the theoretical Gromov-Wasserstein loss to mitigate the issue of insufficient training samples. Furthermore, inspired by lifelong learning, we introduce an episodic memory component for appearance generation, which propels continuous learning and the accumulation of past knowledge over a lifetime. More specifically, we store previous poor generations in the episodic memory and replay these poor generations periodically in training. This enforces the network to be able to consistently learn from its own mistakes and accumulate experiences. Interestingly, the mechanism is similar to our human brain that occasionally recaps significant moments recorded in our memory. The entire procedure of memory replay is formulated in Algorithm2. We may describe the high-level idea as:

1:Training2:Input: training samples $\left<P_{t},F_{t}\right>_{t=1}^{T}$ , replay time interval K3:# ${P}_{i}$ stands for the desired pose for the target person in the $i^{th}$ frame, and ${F}_{i}$ represents the corresponding appearance4:Output: generation model G5:forepoch = 1:Ndo6:if epoch mod K = 0then7:Sample m examples from M8:# M represents memory9:Calculate Gromov-Wasserstein loss and perceptual loss, and then perform backpropagation to update the parameters of G10:# Experience Replay11:endif12:fort = 1:Tdo13:Retrieve training samples $\left<P_{t-1}^{t+1},F_{t-1}^{t+1}\right>$ 14:Calculate Gromov-Wasserstein loss and perceptual loss, and then perform backpropagation to update the parameters of G15:ifstore memorythen16:Write $\left<P_{t-1}^{t+1},F_{t-1}^{t+1}\right>$ to memory M17:endif18:if $perceptual\_loss>loss\_threshold$ then19:Write the poor generation examples $\left<P_{t-1}^{t+1},F_{t-1}^{t+1}\right>$ into memory M20:endif21:endfor22:endfor23:Return G24:Inference25:Input: the poses $P_{t=1}^{T}$ of source person $\mathbb{S}$ , the generation model G26:Output: the foreground $\overline{F}_{t=1}^{T}$ 27:# the generated foreground (appearance) $\overline{F}$ for target person $\mathbb{T}$ 28:fort in range(1:T:3)do29: $\overline{F}_{t}^{t+2}$ = G ( $P_{t}^{t+2}$ );30:endfor31:return $\overline{F}_{t=1}^{T}$

In the first epoch, we utilize all training samples to train the pose-to-appearance generation network (Algorithm 2 lines 12-14). We then select all the poorly generated samples (when the perceptual loss exceeds a threshold) and randomly select a few other samples, and put them into the episodic memory (Algorithm 2 lines 15-19). In the following epochs, we retrain all the training samples for the second time. During the second-time training, we randomly select several samples from the episodic memory and replay (retrain) them to update the parameters of the pose-to-appearance generation network per K epochs (Algorithm 2 lines 6-11).

Memory replay would keep the model from catastrophic forgetting, continuously improving the generated frames by learning from past poor generations. However, overfitting problems may arise if the training samples in the memory are revisited too frequently. Following [54], our memory replay is designed to be executed only occasionally.

3.4 Foreground and Background Fusion

Up to this point, we have obtained the polished foreground $\widetilde{F}$ . In the pre-processing phase, we have computed the mask matrix $M$ of the foreground in the image and have refilled foreground pixels in the background $B$ following[48]. We utilize a linear sum to couple foreground $\widetilde{F}$ and background $B$ .

\displaystyle\tilde{I}=M\odot\widetilde{F}+(1-M)\odot B

(9)

3.5 Loss Functions

For the pose-to-appearance generator, we utilize the Gromov-Wasserstein and perceptual losses. Now, we zoom in the loss functions of the discriminator. We introduce a standard adversarial loss, where a quality discriminator $D_{q}$ attempts to discern the real and generated frames:

\displaystyle\begin{split}\mathcal{L}_{q}&=\mathbb{E}_{({P,F})}[\log D_{q}((P,%F)]\\&+\mathbb{E}_{P}[\log(1-D_{q}(P,\overline{F}))]\end{split}

(10)

We additionally propose a temporal consistency loss to ensure the temporal smoothness of the generated video:

\displaystyle\begin{split}&\mathcal{L}_{t}=\mathbb{E}_{({P,F})}[\log D_{t}(P_{%t-1}^{t+1},{F}_{t-1}^{t+1})]\\&+\mathbb{E}_{P}[\log(1-D_{t}(P_{t-1}^{t+1},{\overline{{F}}}_{t-1}^{t+1})]\end%{split}

(11)

where $D_{t}$ is a temporal discriminator which tries to distinguish the real frame sequence ${F}_{t-1}^{t+1}$ and fake sequence ${\overline{{F}}}_{t-1}^{t+1}$ .

Do as I Do: Pose Guided Human Motion Copy (6)

Do as I Do: Pose Guided Human Motion Copy (7)

4 Experiments

In this section, we first present the experimental settings and the details of the five benchmark datasets, iPER [9], ComplexMotion [55], SoloDance [10] Fish dataset [56], and Mouse dataset [57]. Then, we introduce evaluation metrics for motion copy and compare our method with the state-of-the-art approaches. Further, we investigate the effects of different components in our framework. Finally, we try adapting our method to other articulated objects including fish and mice.Briefly, we seek to answer the following research questions.

•
RQ1: How is the proposed method compared to state-of-the-art methods on human motion copy?
•
RQ2: Is our method able to synthesize motion videos with attractive details for a target person?
•
RQ3: How much do different components of our method contribute to the performance?
•
RQ4: How well does the proposed method generalize to animals such as fish and mouse?

Next, we introduce the experimental settings and empirically investigate the research questions one by one.

4.1 Experimental Settings

4.1.1 Datasets

Experiments are conducted on five benchmark datasets, iPER [9], ComplexMotion [55], SoloDance [10], Fish dataset [56], and Mouse dataset [57].

iPER. For human motion copy, we experiment on iPER [9] dataset, which contains 30 persons with different shapes, heights, and genders. A person may wear different outfits, and there are 103 outfits in total. The dataset contains 241,564 frames from 206 videos. Within the videos, different actions includingarm exercise, stretching exercise, standing and reaching, leaping, swimming, taichi, chest mobility exercise, leg stretching, squat, and leg-raising are involved.

ComplexMotion. We also conduct experiments on ComplexMotion [55] dataset, which contains rapid and complex motions of more than $50$ persons.The videos are collected from various video platforms such as Tiktok¹¹1https://www.tiktok.com and Youtube²²2https://www.youtube.com.In particular, ComplexMotion consists of 68,320 frames from 122 videos. Within the videos, persons wear various clothes and perform complex movements such as street dance, sports, and kung fu.

SoloDance Dataset We further conduct experiments on SoloDance [10] dataset, which contains 179 dance videos with 53,700 frames. Specifically, 143 human subjects were captured with each wearing different clothes and performing complex dances (e.g., modern, street dances) in various scenes.

Fish dataset. For motion copy from a fish to another fish, we utilize the Fish dataset [56], which contains 14 fish videos of 6 different fishes. Specifically, each video consists of 2,250 to 24,000 frames.

Mouse dataset. For mouse motion copy, we use the Mouse dataset [57], which includes 12 mouse videos of 4 mice. Mouse depth images were captured at 25 FPS with the top-view Primesense Carmine camera. The 3D poses of the mouse are extracted from the depth image using annotation tool of [57]. Then, we project the 3D poses to the 2D plane, and obtain the 2D poses of the mice. Specifically, the number of frames in each video varies from 500 to 30,000.

4.1.2 Implementation details

We utilize PyTorch 1.4.0 to realize our proposed framework. We train the FakeVideo framework independently on iPER and ComplexMotion. During training, all the frames are resized to 512 $\times$ 512. We utilize OpenPose to detect 18 human joints in each frame of the dataset. For Fish and Mouse datasets, we would like to point out that we do not further enhance the local details of the generated fish and mice, since they are small in body size and the pose-to-appearance generation seems to be sufficient to yield realistic fish and mice. We employ Mask-RCNN to disentangle the foreground sequence (body) and the background sequence from a video. We utilize the pre-trained VGGNet [58] as our frame feature extractor, which consists of 16 convolutional layers and 3 fully connected layers. The output of the $16^{th}$ convolutional layer is the extracted feature, which is used for perceptual loss. We train our model for 120 epochs on a server with NVIDIA GeForce RTX 2080 Ti GPUs.

Methods	ComplexMotion						iPER
	Image Reconstruction			Motion Imitation			Image Reconstruction			Motion Imitation
	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	IS $\uparrow$	TCM $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	IS $\uparrow$	TCM $\uparrow$
EDN [6]	0.823	24.36	0.061	64.12	3.411	0.534	0.852	24.48	0.086	57.52	3.305	0.591
FSV2V [7]	0.748	22.51	0.132	99.11	3.164	0.575	0.824	21.18	0.108	107.29	3.136	0.754
PoseWarp [13]	0.711	21.42	0.149	78.21	3.109	0.334	0.792	22.16	0.119	115.23	3.095	0.601
LWGAN [9]	0.789	24.27	0.081	85.30	3.398	0.683	0.843	22.32	0.091	76.38	3.258	0.729
C2F-FWN [10]	0.878	25.68	0.048	53.19	3.408	0.689	0.847	24.32	0.074	60.12	3.412	0.769
FakeMotion [55]	0.883	27.15	0.040	48.03	3.543	0.773	0.856	25.86	0.068	56.27	3.461	0.799
FakeVideo (Ours)	0.896	27.52	0.032	46.62	3.728	0.813	0.868	26.72	0.049	54.94	3.582	0.872

Methods	ComplexMotion					iPER
	Image Reconstruction			Motion Imitation		Image Reconstruction			Motion Imitation
	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	IS $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	IS $\uparrow$
Our method, Complete	0.896	27.52	0.032	46.62	3.728	0.868	26.72	0.049	54.94	3.582
Ablation study onDense Skip Connections
r/m dense skip connections	0.868	25.28	0.050	62.39	3.218	0.813	24.21	0.075	62.12	3.271
Ablation study onSelf-Supervised Face Enhancement
Image feature	0.703	22.42	0.129	83.44	3.215	0.634	22.14	0.108	99.34	3.019
2 face vectors	0.728	24.68	0.129	78.20	3.108	0.719	21.34	0.114	89.51	3.167
3 face vectors	0.758	25.62	0.079	56.80	3.331	807	22.56	0.089	73.33	3.267
4 face vectors	0.784	26.08	0.088	59.71	3.304	0.753	21.52	0.098	75.28	3.261
5 face vectors	0.883	27.15	0.040	48.03	3.543	0.856	25.86	0.068	56.27	3.461
1 candidate face	0.732	24.88	0.139	75.20	3.158	0.689	20.24	0.104	88.51	3.067
2 candidate faces	0.793	26.22	0.083	60.91	3.371	0.746	22.72	0.088	75.54	3.321
Ablation study on Multiple Local GANs
r/m multi-local GAN	0.872	26.19	0.053	61.49	3.320	0.848	24.19	0.078	63.22	3.373

4.2 Comparison with State-of-the-art Methods on Multiple Metrics (RQ1)

In this section, we compare our proposed approach with existing state-of-the-art approaches, which include:

•
EDN (Everybody dance now) [6]: A well-known pose-guided method for human motion copy, which makes amateurs dance like ballerinas.
•
C2F-FWN (Coarse-to-fine flow warping network) [10]: A novel motion copy method, which warps the layout based on the transformation flow.
•
FakeMotion [55]: A motion copy approach, which generates human appearance with optimal transport theory and polishes the local body parts with multiple local GANs.
•
FSV2V (Few-shot video2video) [7]: A high-resolution and few-shot video generation method which is applicable to motion copy, facial expression transformation, etc.
•
LWGAN (Liquid warping GAN) [9]: A unified warping framework which implements human motion copy, appearance (clothes) transfer, and novel view generation.
•
PoseWarp [13]: A motion copy method for sport scenes. In the method, 3D poses rather than 2D poses are utilized as the motion intermediary, which provide the spatial characteristics of a motion.

To quantitatively compare the performance of our method and existing approaches, we divide the applications into two scenes: Image Reconstruction and Motion Imitation. For Image Reconstruction, we perform self-mimicry experiments in which persons imitate actions from themselves. In other words, we feed the pose skeleton of a subject into the network and output the human image of the same subject. We adopt Structural Similarity (SSIM) [59] as a low-level metric, Peak Signal to Noise Ratio (PSNR) and Learned Perceptual Image Patch Similarity (LPIPS) [60] as the perceptual level metrics to evaluate the quality of the generated image sequence.For Motion Imitation, we perform cross-mimicry where persons imitate the movements of others. Put differently, we input the pose skeleton of a subject into the network and output the human image of another subject.We utilize the Inception Score (IS) [61] and Frechet Inception Distance score (FID) [62] to examine the differences between the generated images and the ground truth images. In addition, following the method in [10], we employ Temporal Consistency Metric [63] to measure the temporal continuity of the generated video. The experimental results on the ComplexMotion and iPER datasets are summarized in TableI. From the table, we have the following observations. (1) Among the existing methods, C2F-FWN [10] and FakeMotion [55] achieve the current state-of-the-art performance on both the two datasets. (2) FakeVideo is able to outperform state-of-the-art approaches in both Image Reconstruction and Motion Imitation. For example, in image reconstruction and motion imitation, FakeVideo gains 7.2% and 12.4% improvements on PSNR and FID metrics respectively. The significant performance improvements suggest the potential of FakeVideo to perform motion copy. The experimental results on the SoloDance datasets are summarized in TableIII. From the table, we have the following observations. (1) Among the existing methods, C2F-FWN [10] achieves the current state-of-the-art performance on SoloDance datasets. (2) FakeVideo is able to outperform state-of-the-art approaches. For example, FakeVideo gains 4.4% and 4% improvements on PSNR and FID metrics respectively. The performance improvements suggest the potential of FakeVideo to perform motion copy.

{tblr}

cells = c,vline2-6 = -,hline1-2,9 = -,Methods & SSIM PSNR LPIPS FID TCM
EDN [6] 0.811 23.22 0.051 53.17 0.347
FSV2V [7] 0.721 20.84 0.132 112.99 0.106
PoseWarp [13] 0.692 19.80 0.147 120.13 0.102
LWGAN [9] 0.786 20.87 0.106 86.53 0.176
C2F-FWN [10] 0.879 26.65 0.049 46.49 0.641
FakeVideo (Ours) 0.893 27.82 0.038 44.72 0.739

4.3 Visual Comparison with State-of-the-art Methods (RQ2)

Further, we visualize the generated results of state-of-the-art approaches on three datasets, which are depicted in Fig.6. Empirically, PoseWarp [13] and FSV2V [7] may result in distorted body shapes and absent limbs. We conjecture the reasons are that they do not consider fusing information across multiple scales, leading to inevitable information dropping in the generation process. EDN [6] achieves realistic visual results. However, the generated human faces of EDN usually have blurred facial parts. LWGAN [9] and C2F-FWN [10] could effectively copy the motions according to the optical flow, however, they have difficulties in generating fine-grained clothes and hairs. In contrast, our method yields a more realistic human body and plausible local details. More visual results of our method are demonstrated in Fig.7. We see that our method consistently generates realistic frames.

As shown in Fig.6, the first column (Target Person) illustrates the target person, the second column (Source Person) demonstrates the source person, the third column presents desired poses, which are obtained from videos of the source person (not the target person), and the remaining columns represent the generated frames of the target person. As shown in Fig.6 and Fig.7, we would like to clarify that the source person and the target person are not the same individual, they are different in faces, body shapes, clothes, and even in gender. Fig.6 and Fig.7 contain multiple source and target persons pairs from three datasets.

	Complex Motion			iPER
	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$
$L_{p}$	0.838	25.10	0.058	0.820	24.11	0.081
$L_{GW}$	0.883	27.15	0.040	0.856	25.86	0.068
$L_{p}+L_{GW}$	0.896	27.52	0.032	0.868	26.72	0.049

	Complex Motion			iPER
	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$
w/o memory	0.892	27.38	0.036	0.860	26.38	0.051
with memory	0.896	27.52	0.032	0.868	26.72	0.049

4.4 Study on Key Components of FakeVideo (RQ3)

In this subsection, we study the effects of the different components of our method. With this goal in mind, we tried (1) removing dense skip connections from the pose-to-appearance generation GAN, (2) utilizing different kinds of loss functions in the generation network, (3) removing memory module from our framework, (4) using different face enhancement strategies, and (5) removing multiple local GANs which are responsible for local enhancement.

Removing dense skip connections from the pose-to-appearance GAN. We first investigate the effect of removing the dense skip connections in the pose-to-appearance GAN. From TableII, we observe that the performance of the method degrades significantly upon the removal of dense skip connections. This is consistent with our intuition that the dense skip connections could integrate multi-level latent features and is able to access lower-level pose details, contributing to better performance.

Utilizing different kinds of loss functions. Then, we examine the effects of different kinds of loss functions. In the pose-to-appearance generation GAN, we employ a perceptual loss $\mathcal{L}_{p}$ and a Gromov-Wasserstein Loss $\mathcal{L}_{\textit{GW}}$ . To study the impacts of the two loss functions, we conduct comparing experiments on using $\mathcal{L}_{p}$ , $\mathcal{L}_{\textit{GW}}$ , and the combined losses $\mathcal{L}_{p}+\mathcal{L}_{\textit{GW}}$ , respectively. The empirical results are elaborated in TableIV. The combined losses (i.e., $\mathcal{L}_{p}+\mathcal{L}_{\textit{GW}}$ ) achieves the best performance, while using $\mathcal{L}_{\textit{GW}}$ alone yields better results than using $\mathcal{L}_{p}$ alone.

Removing memory module from our framework. In order to evaluate the effectiveness of our memory module, we further try removing the episodic memory module from our framework. As shown in TableV, with the memory module, our approach achieves 0.14 and 0.24 higher PSNR scores than its counterpart without the memory module on the two datasets. These evidences show that the memory module does play an important role in boosting the generation quality of our method.

Using different face enhancement strategies. For self-supervised face enhancement, we consider two schemes to select similar face images for the generated face: facial similarity computed using VGG image features and using the proposed face vector field. The results are demonstrated in TableII. Empirically, the face vector field strategy achieves significantly better performance, which is in accordance with our intuition. The image features of faces actually emphasize the appearance information (i.e., colors and eye shapes) while ignoring face orientation information,which has difficulties in effectively selecting similar face images to enhance the generated face. Our face vector field strategy is capable of accurately representing more fine-grained face orientation details, which is conducive to selecting similar face images that are more valuable to compensate for the facial details of the generated face.

We also examine the influence on the number of face vectors, as shown in TableII. We observe that the quality of the generated image gradually increases with the growing number of face vectors. We also ablate the number of similar face images selected, as shown in TableII. We find that three images might provide sufficient facial information, resulting in informed self-supervised face enhancement.

Do as I Do: Pose Guided Human Motion Copy (8)

Removing local GANs from the local enhancement module. Finally, we remove the multiple local GANs to examine their contributions. As shown in TableII, FID significantly increases from $48.03$ to $61.49$ upon the removal of the local GANs. This dramatic image quality degeneration highlights the effectiveness of the local GANs in the local refinement. Particularly, the residual images of human body parts generated by the local GANs reveal the difference in color and texture details between the generated image of the body part and the ground truth, facilitating the generation of more lifelike local images.

4.5 Motion Copy on Other Articulated Objects (RQ4)

In addition to performing motion copy on humans, we are curious whether our approach could be generalized to other articulated objects including zebra fish and mouse. To this end, we further conduct experiments on fish [56] and mouse [57] datasets. Interestingly, our method could be adapted to copy motions of fish and mouse. Empirical results are demonstrated in Fig.8. Take fish as an example, in the training stage, we first employ Lie-X [56] to detect the desired poses of the fish from given videos. Then, we disentangle the frames of the video of the target fish into foreground and background using Mask-RCNN [47]. Thereafter, we feed the desired poses into our pose-to-appearance generation network, where the network architecture remains the same but with the feature size adapted to fit fish. We would like to point out that we do not enhance the details of the generated frames, since a zebra fish is small in body size and the pose-to-appearance generation seems to be sufficient to yield realistic fish frames. Finally, we couple the generated foreground and the background, obtaining the entire fish videos. In the inference stage, the network is fed with the desired poses from another fish and we could synthesize a lively video of the target fish, where the target fish swims and acts like another fish. Experiments on fish and mouse show that our method is able to copy motions of other articulated objects.

4.6 Discussion about the Computational Time Comparison

Motion transfer models could be classified into two categories: dedicated-purpose models and general-purpose models. Specifically, the dedicated-purpose models excel in generating fake videos of a specific person and offer high video quality at the expense of longer training time. In contrast, general-purpose models have the capability to generate fake videos of any person, necessitating less training time but yielding less satisfactory generation results compared to dedicated-purpose models. In this paper, we concentrate on dedicated-purpose models. We would like to emphasize that despite the longer training time required for our dedicated-purpose model, it offers a shorter inference time. Empirical results are represented in Table VI. Specifically, during the inference phase, EDN [6] achieves an average Frames Per Second (FPS) of 14.29, [55] achieves an average FPS of 15, and our method has an average FPS of 25.25.

Methods	EDN	FakeMotion	FakeVideo (Ours)
FPS	14.29	15	25.25

4.7 User study

We conduct a user study, engaging a cohort of 25 volunteers, to meticulously evaluate the quality of the generated results. Each participant is presented with six clusters of generated results, and subsequently requested to discern and designate the best results within each group. Ultimately, we collect a total of 25 responses, the results are shown in Fig. 9. We can see that the proposed FakeVideo gets the highest rating and significantly outperforms other methods (EDN [6], PoseWarp [13], FSV2V [7], C2F-FWN [10], and LWGAN [38]).

Do as I Do: Pose Guided Human Motion Copy (9)

4.8 Failure Case Analysis

Do as I Do: Pose Guided Human Motion Copy (10)

Interestingly, we also observed a few failure cases. The failure cases are shown in Figure 10. The first failure case is shown in Fig. 10 (a). When the hands of the source character are undetectable, the generated hands are not realistic enough. The second failure case is shown in Fig. 10 (b). If there are artifacts in the input source image, such as elongated arms and lower legs, this will result in a target image with missing forearms and misaligned lower legs.In the subsequent work, we plan to implement the following measures to further improve the generation:

(1) We will develop a more accurate human pose estimation framework, which plays an important role in the task of human motion copy.

(2) Additionally, we will enhance the network structure. Specifically, in cases where a generated limb of the synthesized frame is missing, the network will strive to generate a limb that aligns with the target person, thereby ensuring a more coherent output.

5 Conclusion

In this work, we present a novel approach FakeVideo for motion copy. The crucial ingredient is proposing a pose-to-appearance generation network with Gromov-Wasserstein and perceptual losses, and a memory module that consistently learns from its past poor generations. We further introduce a self-supervised face enhancement module that resorts to face frames with similar orientations to polish facial details of the generated face. Interestingly, our approach could be generalized to other articulated objects, including fish and mouse. Extensive empirical results on five datasets, iPER, ComplexMotion, SoloDance, fish and mouse datasets, demonstrate the efficacy of the proposed method.

References

[1]V.Visch, E.Tan, and D.P. Saakes, “Viewer knowledge: Application ofexposure-based layperson knowledge in genre-specific animation production,”International Journal of Design, vol.9, no.1, pp. 83–89, 2015.
[2]T.-Y. Mou, “Creative story design method in animation production pipeline,”in DS79: Proceedings of The Third International Conference on DesignCreativity, Indian Institute of Science, Bangalore, 2015.
[3]J.Carmigniani, B.Furht, M.Anisetti, P.Ceravolo, E.Damiani, and M.Ivkovic,“Augmented reality technologies, systems and applications,”Multimedia tools and applications, vol.51, no.1, pp. 341–377, 2011.
[4]Y.Siriwardhana, P.Porambage, M.Liyanage, and M.Ylianttila, “A survey onmobile augmented reality with 5g mobile edge computing: architectures,applications, and technical aspects,” IEEE Communications Surveys &Tutorials, vol.23, no.2, pp. 1160–1192, 2021.
[5]S.Tulyakov, M.-Y. Liu, X.Yang, and J.Kautz, “Mocogan: Decomposing motionand content for video generation,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2018, pp. 1526–1535.
[6]C.Chan, S.Ginosar, T.Zhou, and A.A. Efros, “Everybody dance now,” inProceedings of the IEEE/CVF international conference on computervision, 2019, pp. 5933–5942.
[7]T.-C. Wang, M.-Y. Liu, A.Tao, G.Liu, J.Kautz, and B.Catanzaro, “Few-shotvideo-to-video synthesis,” arXiv preprint arXiv:1910.12713, 2019.
[8]M.Ghafoor, K.Javed, and A.Mahmood, “Walk like me: Video to video actiontransfer,” IEEE TRANSACTIONS ON MULTIMEDIA, p.1, 2022.
[9]W.Liu, Z.Piao, J.Min, W.Luo, L.Ma, and S.Gao, “Liquid warping gan: Aunified framework for human motion imitation, appearance transfer and novelview synthesis,” in Proceedings of the IEEE/CVF InternationalConference on Computer Vision, 2019, pp. 5904–5913.
[10]D.Wei, X.Xu, H.Shen, and K.Huang, “C2F-FWN: Coarse-to-fine flow warpingnetwork for spatial-temporal consistent motion transfer,” inProceedings of the AAAI Conference on Artificial Intelligence,vol.35, no.4, 2021, pp. 2852–2860.
[11]D.Joo, D.Kim, and J.Kim, “Generating a fusion image: One’s identity andanother’s shape,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2018, pp. 1635–1643.
[12]Z.Huang, X.Han, J.Xu, and T.Zhang, “Few-shot human motion transfer bypersonalized geometry and texture modeling,” in Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.2297–2306.
[13]G.Balakrishnan, A.Zhao, A.V. Dalca, F.Durand, and J.Guttag, “Synthesizingimages of humans in unseen poses,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2018, pp. 8340–8348.
[14]Z.Liu, K.Lyu, S.Wu, H.Chen, Y.Hao, and S.Ji, “Aggregated multi-gans forcontrolled 3d human motion prediction,” in Proceedings of the AAAIConference on Artificial Intelligence, vol.35, no.3, 2021, pp. 2225–2232.
[15]D.P. Kingma and M.Welling, “Auto-encoding variational bayes,” arXivpreprint arXiv:1312.6114, 2013.
[16]A.VanOord, N.Kalchbrenner, and K.Kavukcuoglu, “Pixel recurrent neuralnetworks,” in International conference on machine learning.PMLR, 2016, pp. 1747–1756.
[17]H.Zheng, J.Chen, H.Du, W.Zhu, S.Ji, and X.Zhang, “Grip-gan: Anattack-free defense through general robust inverse perturbation,” IEEETransactions on Dependable and Secure Computing, 2021.
[18]J.Zhang, K.Li, Y.-K. Lai, and J.Yang, “Pise: Person image synthesis andediting with decoupled gan,” in Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, 2021, pp. 7982–7990.
[19]I.Goodfellow, J.Pouget-Abadie, M.Mirza, B.Xu, D.Warde-Farley, S.Ozair,A.Courville, and Y.Bengio, “Generative adversarial nets,” Advancesin neural information processing systems, vol.27, 2014.
[20]P.Isola, J.-Y. Zhu, T.Zhou, and A.A. Efros, “Image-to-image translationwith conditional adversarial networks,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2017, pp. 1125–1134.
[21]H.Zhang, T.Xu, H.Li, S.Zhang, X.Wang, X.Huang, and D.N. Metaxas,“Stackgan: Text to photo-realistic image synthesis with stacked generativeadversarial networks,” in Proceedings of the IEEE internationalconference on computer vision, 2017, pp. 5907–5915.
[22]L.Gao, D.Chen, Z.Zhao, J.Shao, and H.T. Shen, “Lightweight dynamicconditional gan with pyramid attention for text-to-image synthesis,”Pattern Recognition, vol. 110, p. 107384, 2021.
[23]J.Cao, Y.Hu, B.Yu, R.He, and Z.Sun, “3d aided duet gans for multi-viewface image synthesis,” IEEE Transactions on Information Forensics andSecurity, vol.14, no.8, pp. 2028–2042, 2019.
[24]Y.He, J.Zhang, H.Shan, and L.Wang, “Multi-task gans for view-specificfeature learning in gait recognition,” IEEE Transactions onInformation Forensics and Security, vol.14, no.1, pp. 102–113, 2018.
[25]J.Yang, D.Ruan, J.Huang, X.Kang, and Y.-Q. Shi, “An embedding costlearning framework using gan,” IEEE Transactions on InformationForensics and Security, vol.15, pp. 839–851, 2019.
[26]S.Shehnepoor, R.Togneri, W.Liu, and M.Bennamoun, “Scoregan: A fraud reviewdetector based on regulated gan with data augmentation,” IEEETransactions on Information Forensics and Security, vol.17, pp. 280–291,2021.
[27]M.Arjovsky and L.Bottou, “Towards principled methods for training generativeadversarial networks,” arXiv preprint arXiv:1701.04862, 2017.
[28]I.Gulrajani, F.Ahmed, M.Arjovsky, V.Dumoulin, and A.C. Courville,“Improved training of wasserstein gans,” Advances in neuralinformation processing systems, vol.30, 2017.
[29]L.Ma, X.Jia, Q.Sun, B.Schiele, T.Tuytelaars, and L.VanGool, “Poseguided person image generation,” Advances in neural informationprocessing systems, vol.30, 2017.
[30]A.Pumarola, A.Agudo, A.Sanfeliu, and F.Moreno-Noguer, “Unsupervised personimage synthesis in arbitrary poses,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2018, pp. 8620–8628.
[31]T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G.Liu, A.Tao, J.Kautz, and B.Catanzaro,“Video-to-video synthesis,” arXiv preprint arXiv:1808.06601, 2018.
[32]A.Bansal, S.Ma, D.Ramanan, and Y.Sheikh, “Recycle-gan: Unsupervised videoretargeting,” in Proceedings of the European conference on computervision (ECCV), 2018, pp. 119–135.
[33]P.Esser, E.Sutter, and B.Ommer, “A variational u-net for conditionalappearance and shape generation,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2018, pp. 8857–8866.
[34]J.Ren, M.Chai, S.Tulyakov, C.Fang, X.Shen, and J.Yang, “Human motiontransfer from poses in the wild,” in European Conference on ComputerVision.Springer, 2020, pp. 262–279.
[35]C.Xu, Y.Fu, C.Wen, Y.Pan, Y.-G. Jiang, and X.Xue, “Pose-guided personimage synthesis in the non-iconic views,” IEEE Transactions on ImageProcessing, vol.29, pp. 9060–9072, 2020.
[36]Z.Yang, W.Zhu, W.Wu, C.Qian, Q.Zhou, B.Zhou, and C.C. Loy, “Transmomo:Invariance-driven unsupervised video motion retargeting,” inProceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition, 2020, pp. 5306–5315.
[37]H.Dong, X.Liang, K.Gong, H.Lai, J.Zhu, and J.Yin, “Soft-gatedwarping-gan for pose-guided person image synthesis,” Advances inneural information processing systems, vol.31, 2018.
[38]W.Liu, Z.Piao, Z.Tu, W.Luo, L.Ma, and S.Gao, “Liquid warping gan withattention: A unified framework for human image synthesis,” IEEETransactions on Pattern Analysis and Machine Intelligence, 2021.
[39]A.Shysheya, E.Zakharov, K.-A. Aliev, R.Bashirov, E.Burkov, K.Iskakov,A.Ivakhnenko, Y.Malkov, I.Pasechnik, D.Ulyanov etal., “Texturedneural avatars,” in Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, 2019, pp. 2387–2397.
[40]L.Liu, W.Xu, M.Habermann, M.Zollhöfer, F.Bernard, H.Kim, W.Wang, andC.Theobalt, “Neural human video rendering by learning dynamic textures andrendering-to-video translation,” arXiv preprint arXiv:2001.04947,2020.
[41]X.Han, X.Hu, W.Huang, and M.R. Scott, “Clothflow: A flow-based model forclothed person generation,” in Proceedings of the IEEE/CVFinternational conference on computer vision, 2019, pp. 10 471–10 480.
[42]S.Liu, Y.Li, and G.Hua, “Human pose estimation in video via structuredspace learning and halfway temporal evaluation,” IEEE Transactions onCircuits and Systems for Video Technology, vol.29, no.7, pp. 2029–2038,2018.
[43]M.Ghafoor and A.Mahmood, “Quantification of occlusion handling capability of3d human pose estimation framework,” IEEE Transactions on Multimedia,2022.
[44]S.Aftab, S.F. Ali, A.Mahmood, and U.Suleman, “A boosting framework forhuman posture recognition using spatio-temporal features along with radontransform,” Multimedia Tools and Applications, vol.81, no.29, pp.42 325–42 351, 2022.
[45]Z.Cao, T.Simon, S.-E. Wei, and Y.Sheikh, “Realtime multi-person 2d poseestimation using part affinity fields,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2017, pp. 7291–7299.
[46]Z.Liu, H.Chen, R.Feng, S.Wu, S.Ji, B.Yang, and X.Wang, “Deep dualconsecutive network for human pose estimation,” in Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.525–534.
[47]K.He, G.Gkioxari, P.Dollár, and R.Girshick, “Mask r-cnn,” inProceedings of the IEEE international conference on computer vision,2017, pp. 2961–2969.
[48]J.Yu, Z.Lin, J.Yang, X.Shen, X.Lu, and T.S. Huang, “Free-form imageinpainting with gated convolution,” in Proceedings of the IEEE/CVFinternational conference on computer vision, 2019, pp. 4471–4480.
[49]Y.Cheng, C.Xu, Z.Hai, and Y.Li, “Deepmnemonic: Password mnemonicgeneration via deep attentive encoder-decoder model,” IEEETransactions on Dependable and Secure Computing, 2020.
[50]Y.Wang, D.J. Tan, N.Navab, and F.Tombari, “Softpool++: An encoder–decodernetwork for point cloud completion,” International Journal of ComputerVision, vol. 130, no.5, pp. 1145–1164, 2022.
[51]G.Peyré, M.Cuturi, and J.Solomon, “Gromov-wasserstein averaging ofkernel and distance matrices,” in International Conference on MachineLearning.PMLR, 2016, pp. 2664–2672.
[52]S.Wu, Z.Liu, S.Lu, and L.Cheng, “Dual learning music composition and dancechoreography,” in Proceedings of the 29th ACM International Conferenceon Multimedia, 2021, pp. 3746–3754.
[53]Q.Yang, P.Yan, Y.Zhang, H.Yu, Y.Shi, X.Mou, M.K. Kalra, and G.Wang,“Low dose ct image denoising using a generative adversarial network withwasserstein distance and perceptual loss,” 2017.
[54]C.deMassonD’Autume, S.Ruder, L.Kong, and D.Yogatama, “Episodic memory inlifelong language learning,” Advances in Neural Information ProcessingSystems, vol.32, 2019.
[55]Z.Liu, S.Wu, C.Xu, X.Wang, L.Zhu, S.Wu, and F.Feng, “Copy motion fromone to another: Fake motion video generation,” arXiv preprintarXiv:2205.01373, 2022.
[56]C.Xu, L.N. Govindarajan, Y.Zhang, and L.Cheng, “Lie-x: Depth image basedarticulated object pose estimation, tracking, and action recognition on liegroups,” International Journal of Computer Vision, vol. 123, no.3,pp. 454–478, 2017.
[57]Z.Liu, S.Wu, S.Jin, Q.Liu, S.Ji, S.Lu, and L.Cheng, “Investigating poserepresentations and motion contexts modeling for 3d motion prediction,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[58]K.Simonyan and A.Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[59]Z.Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli, “Image qualityassessment: from error visibility to structural similarity,” IEEEtransactions on image processing, vol.13, no.4, pp. 600–612, 2004.
[60]R.Zhang, P.Isola, A.A. Efros, E.Shechtman, and O.Wang, “The unreasonableeffectiveness of deep features as a perceptual metric,” in Proceedingsof the IEEE conference on computer vision and pattern recognition, 2018, pp.586–595.
[61]T.Salimans, I.Goodfellow, W.Zaremba, V.Cheung, A.Radford, and X.Chen,“Improved techniques for training gans,” Advances in neuralinformation processing systems, vol.29, 2016.
[62]M.Heusel, H.Ramsauer, T.Unterthiner, B.Nessler, and S.Hochreiter, “Ganstrained by a two time-scale update rule converge to a local nashequilibrium,” Advances in neural information processing systems,vol.30, 2017.
[63]C.-H. Yao, C.-Y. Chang, and S.-Y. Chien, “Occlusion-aware video temporalconsistency,” in Proceedings of the 25th ACM international conferenceon Multimedia, 2017, pp. 777–785.