MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

What is MeshTalk?

Speech하는 facial 3D animation을 생성하는 method
- 기존의 speech 3D animation은 audio-driven facial animation이었음
  - 이렇게 되면, uncanny 및 정적인 animation을 보여주는 단점이 있음
  - 또, person-specific한 model이라는 단점도 있음
본 논문에선 cross-modality loss를 활용하여 audio-correlated와 audio-uncorrelated information을 disentangle하여 facial animation에 관한 categorical latent sapce를 활용
자연스러운 synthesizing animation과 매우 정확한 입술 움직임을 보여줌

Methodology

오직 speech만을 통해 arbitrary한 neutral face mesh를 만들어내는 것이 목표

Untitled

Encoder는 audio sequences와 expression sequences를 multi-head categorical latent space로 매핑함
Continuous-valued encoding은 각 latent classification head에서 Gumbel-softmax를 사용하여 categorical representation으로 변형됨
Decoder는 인코딩된 expression을 template mesh $h$에 매핑함
Ground truth는 같은 identity내에서 template mesh, speech signal, expression signal만 사용할 수 있음
Decoder output $\hat{h}{1:T}$는 input $x{1:T}$의 expression과 동일
간단한 $\hat{h}{1:T}$와 $x{1:T}$간의 reconstruction loss을 사용하게 되면, 좋지 않은 speech-to-lip synchronization을 도출함
- 이는 perfect reconstruction을 위한 필요한 정보든 이미 expression signal에 포함되어 있기 때문에 audio signal을 무시하게 됨
그래서 본 논문에선 cross-modality loss로 speech와 expression modalities information을 모두 보장해줌
$\hat{h}_{1:T}$를 reconstruction 하는 대신, 두 개의 다른 reconstruction을 사용

$$ \hat{h}^{(audio)}{1:T}=\mathcal{D}(h_x,\mathcal{E}(\tilde{x}{1:T},a_{1:T})) and \\\hat{h}^{(expr)}{1:T}=\mathcal{D}(h_x,\mathcal{E}(x{1:T},\tilde{a}_{1:T})) $$
- $x_{1:T}, a_{1:T}$는 expression과 speech sequence
- $h_x$는 template mesh
- $\tilde{x}{1:T}, \tilde{a}{1:T}$는 training set으로부터 랜덤하게 샘플링된 exrpession과 audio sequence
- 정리하자면, $\hat{h}^{(audio)}{1:T}$는 audio는 correct하지만 expression sequence는 랜덤하게 줘 reconstruction하는 것. $\hat{h}^{(expr)}{1:T}$는 expression sequence는 correct하지만 audio는 랜덤하게 줘 reconstruction 하는 것임
따라서 본 논문에서 제시한 novel cross-modality loss는 다음과 같음

$$ \mathcal{L}{xMod}=\sum^T{t=1}\sum^V_{v=1}\mathcal{M}v^{(upper)}\big(||\hat{h}{t,v}^{(expr)}-x_{t,v}||^2\big)+\\\sum_{t=1}^T\sum_{v=1}^V\mathcal{M}v^{(mouth)}\big(||\hat{h}{t,v}^{(audio)}-x_{t,v}||^2\big) $$
- $\mathcal{M}^{(upper)}$는 입 주변 vertices의 가중치는 낮게, 입 위쪽 vertices의 가중치는 높게 할당한 mask를 의미
- $\mathcal{M}^{(mouth)}$는 입 주변 가중치는 높게, 나머지의 가중치는 낮게 할당한 mask를 의미