dc.description.abstract | Accurately estimating poses of multiple individuals in unconstrained scenes would improve many vision-based applications. As a few examples: person re-identification, human-computer interaction, behavioral analysis and scene understanding. Through the advancements on convolutional networks’ research, body part detectors are now accurate and can estimate spatial positioning on still images in real-time (30 FPS), for both single- and multi-person scenarios. In turn, multiple individuals interacting in videos impose additional challenges, such as person-to-person occlusion, truncated body parts, additional assignment steps and more sources for double counting. In the last few years, many advancements contributed towards this goal and partially solved some of these challenges. Nonetheless, dealing with long-term person-toperson occlusion is not possible in still images, due to the lack of discriminative features to detect the occluded individual. Most reviewed works solve this problem by collecting motion features that correlate body parts across multiple video frames, exploring temporal dependency. Usually, these approaches either rely only on adjacent frames to keep it close to real-time or process the whole video beforehand, imposing global consistency in an offline manner. Since most of the cited applications rely on near real-time processing in combination with complex human motions, which are not depicted in just a couple frames, we propose the PastLens model. Our main objective is to provide a cost-efficient alternative to the tradeoff between the number of correlated frames and the estimation time. The model impose spatio-temporal constraints to the convolutional network itself, instead of relying on arbitrary designed temporal features. We stretch the receptive field of the mid layers to also include the previous frame, forcing further layers to detect features that correlate poses across the two frames, without losing the per-frame configuration. Moreover, we do not constraint the representation of such features, allowing it to be learned throughout the training process, alongside the pose estimation. By pose estimation and tracking, we refer to the localization and tracking overtime of head, limbs and torso, followed by the assembling of these body parts into poses that correctly encode the scene. We will not evaluate our approach on benchmarks for facial keypoints or gesture recognition. Pose- Track is the dataset of choice for both training and validation steps, since it provides a publicly available benchmark for estimating and tracking poses, in addition to a leaderboard that enable direct comparison of our results with its state-of-the-art counterparts. Experimental results indicate that our model can reach competitive accuracy on multi-person videos, while containing less operations and being easier to attach to pretrained networks. Regarding scientific contributions, we provide a cost-efficient alternative to impose temporal consistency to the HPE pipeline, through receptive field increase only, letting the temporal features’ representation to be learned from data. Hence, our results may lead towards novel ways of exploring temporal consistency for human pose estimation in videos. | en |