Check out Decaf, a novel monocular motion capture method for face and hand interactions.
A team of researchers from Max Planck Institute for Informatics and Valeo.ai has officially introduced Decaf, a brand-new method for setting up highly-realistic 3D hand and face reconstructions, described by the team as the first monocular motion capture method from a video that regresses 3D hand and face motions along with deformations arising from their interactions.
According to the research team, their approach involves representing hands as articulated objects that induce non-rigid facial deformations during active interactions. To support this method, they have developed a comprehensive dataset for capturing hand-face motions and interactions, complete with realistic facial deformations, all acquired through a markerless multi-view camera system.
"As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions, and head-hand positions," commented the researchers. "At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively."
"Our Decaf approach captures hands and face motions as well as the face surface deformations arising from the interactions from a single-view RGB video," reads the research paper shared by the team. "Thanks to our new dataset with 3D surface deformations relying on position-based dynamics that considers the underlying human skull structure, our neural architecture estimates plausible hands-head interactions and head deformations. The examples in this figure highlight the variety of the supported hand poses and facial expressions. The results are temporally consistent."