ORFormer: Occlusion-Robust Transformer
for Accurate Facial Landmark Detection
WACV 2025


Image Description

Abstract

Although facial landmark detection (FLD) has gained significant progress, existing FLD methods still suffer from performance drops on partially non-visible faces, such as faces with occlusions or under extreme lighting conditions or poses. To address this issue, we introduce ORFormer, a novel transformer-based method that can detect non-visible regions and recover their missing features from visible parts. Specifically, ORFormer associates each image patch token with one additional learnable token called the messenger token. The messenger token aggregates features from all but its patch. This way, the consensus between a patch and other patches can be assessed by referring to the similarity between its regular and messenger embeddings, enabling non-visible region identification. Our method then recovers occluded patches with features aggregated by the messenger tokens. Leveraging the recovered features, ORFormer compiles high-quality heatmaps for the downstream FLD task. Extensive experiments show that our method generates heatmaps resilient to partial occlusions. By inte- grating the resultant heatmaps into existing FLD methods, our method performs favorably against the state of the arts on challenging datasets such as WFLW and COFW.


Overview


overview

(a) We first train a quantized heatmap generator, which takes an image I as input and generates its edge heatmaps H. After pre-training, the prior knowledge of unoccluded faces is encoded in the codebook C and decoder D. (b) With the frozen codebook and decoder, we introduce ORFormer to generate the occlusion map α and two code sequences SI and SM, leading to quantized features ZI and ZM. The recovered feature Zrecis yielded by merging ZI and SM with patch-specific weights given in α, and is used to produce occlusion-robust heatmaps Hrec.


Occlusion-Robust Transformer (ORFormer)


ORFormer

ORFormer takes image patches P as input and generates two code sequences SI and SM via the codebook prediction head. While SI is computed by referring to the image patch tokens, SM is by the messenger tokens. The occlusion map α represents the patch-specific occlusion likelihood and is inferred by the occlusion detection head.


Integration with FLD


integration

ORFormer is adopted for occlusion detection and feature recovery, resulting in high-quality heatmaps. The generated heatmaps serve as an extra input to an FLD method, and offer the recovered features to make the FLD method robust to occlusions..


Quantitative Results


quantitative

Quantitative comparison with state-of-the-art methods on WFLW, COFW, and 300W. NME is reported for all datasets. For WFLW, FR and AUC with a threshold of 10% are included. The best and second best results are highlighted. The † symbol represents the results we reproduced.


Qualitative Results


qualitative

Qualitative comparison with the reproduced baseline method, STAR, on extreme cases from the test set of WFLW. The ground-truth landmarks are marked in blue, while the predicted landmarks are in red. The green lines represent the distance between the ground-truth landmarks and the predicted landmarks. Orange ellipses highlight variations between the methods in thechallenging areas.


Alpha Map Visualization


alpha

Visualization of the α maps yielded by ORFormer. Red regions indicate higher values of α, suggesting heavier feature occlusion or corruption detected by ORFormer.


Citation


Acknowledgements

This work was supported in part by the National Science and Technology Council (NSTC) under grants 112-2221-E-A49-090-MY3, 111-2628-E-A49-025-MY3, 112-2634-F-006-002, and 113-2640-E-006-006. This work was funded in part by MediaTek.

The website template was borrowed from CEVR.