Abstract
Human pose and shape (HPS) estimation methods have
been extensively studied, with many demonstrating high
zero-shot performance on in-the-wild images and videos.
However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose
estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with
random patches over the human or clipart-style overlays,
which may not reflect real-world challenges. To bridge
this gap in realistic occlusion datasets, we introduce a
novel benchmark dataset, VOccl3D, a Video-based human
Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we
constructed this dataset using advanced computer graphics
rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions.
Additionally, we fine-tuned recent HPS methods, CLIFF
and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across
multiple public datasets, as well as on the test split of our
dataset, while comparing its performance with other stateof-the-art methods. Furthermore, we leveraged our dataset
to enhance human detection performance under occlusion
by fine-tuning an existing object detector, YOLO11, thus
leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable
resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic
alternative to existing occlusion datasets.