Facial Landmark Detection Using Synthetic Data

Abstract

This white paper describes how we use the Datagen Platform to create a synthetic face dataset that:

Achieves comparable results to other real and synthetic datasets on the task of landmark detection.
Boosts model performance while reducing the amount of real data required.

In this paper, Using a data-centric approach, we iteratively improve performance by optimizing our data rather than our model.

We describe the domain gap which naturally arises when training on synthetic data and testing on real data. We meet both a visual domain gap - the images differ visually - and a label domain gap. The label domain gap is caused by the differences between human 2D labeling and 2D labels based on a 3D model (available only in synthetic data). Additionally, 2D human labeling is prone to more noise than pixel-perfect annotations available using synthetic data. In this work, we describe the measures we took to mitigate these gaps.

We compare two strategies for combining different amounts of real and syntheti data:

mixing and fine-tuning

Methodology

Differences in how the data was obtained or labeled, the position of the camera, and the distribution of populations within the datasets are all examples of domain gaps one may need to bridge.

Visual domain gap

Our source and target domains (synthetic train data and real test data, respectively) come form different distributions. In tackling this issue, we learned that the preprocessing stage is a key factor which which has a great influence on the generalization to the target domain.

The preprocessing steps we experimented with included initial cropping around the face area and different augmentations.

Initial cropping for synthetic data
- While training our model on synthetic data, we randomly crop the face around the bounding box defined by the landmarks in a way that matches the real-data-validation set statistics.
- The idea is to imitate the cropping statistics of the target domain.
- The intervals $I_o^h=[a_o^h,\ b_o^h],\ I_0^v=[a_o^v,\ b_o^v],\ I_t^h=[a_t^h,\ b_t^h],\ I_t^v=[a_t^v,\ b_t^v]$ are defining the crop and the tight bounding box around the landmarks. The indices $o$ and $t$ are standing for outer and tight and $h$ and $v$ are standing for horizontal and vertical.
- Given $I=[a,\ b]$ we define $m(I)=0.5(a+b)$ to be the middle point.
- Following sizes are having Gaussian distributions: $\frac{|I_o|}{|I_T|} \sim N_s$ and $\frac{m(I_o)-m(I_T)}{|I_T|} \sim N_I$.
- We fit 4 Gaussians (8 variables) using the validation data. Therefore, given a data point in the source domain, we can extract $I_t^h$ and $I_t^v$ from the landmarks sample from these distributions, and by solving a system of two linear equations with 2 variables we can calculate the new sampled crop defined by $I_o^h$ and $I_o^v$.
Additional augmentations
- Use both geometric and color augmentations when training.