Figure 1: User study scheme. Participants wearing NeckFace watch videos where a scenario of human or robot error is shown, eliciting a reaction. The IR camera image is converted into 3D facial points and head rotation data through a customized NeckNet model. This data is then used to train error detection models which map human reactions to the scenario displayed.
How do humans recognize and rectify social missteps? We achieve social competence by looking around at our peers, decoding subtle cues from bystanders — a raised eyebrow, a laugh — to evaluate the environment and our actions. Robots, however, struggle to perceive and make use of these nuanced reactions.
By employing a novel neck-mounted device that records facial expressions from the chin region, we explore the potential of previously untapped data to capture and interpret human responses to robot error. First, we develop NeckNet-18, a 3D facial reconstruction model to map the reactions captured through the chin camera onto facial points and head motion. We then use these facial responses to develop a robot error detection model which outperforms standard methodologies such as using OpenFace or video data, generalizing well especially for within-participant data.
Through this work, we argue for expanding human-in-the-loop robot sensing, fostering more seamless integration of robots into diverse human environments, pushing the boundaries of social cue detection and opening new avenues for adaptable and sustainable robotics.
A lightweight 3D facial reconstruction model (ResNet-18 based) that converts IR camera data from a neck-mounted device into:
Machine learning models that detect robot errors from human facial reactions:
First systematic comparison of neck-mounted device data against conventional methods:
Figure 2: Study setup showing the calibration and stimulus rounds. Participants wear NeckFace while watching stimulus videos, with reactions recorded by both the neck-mounted IR cameras and a frontal RGB camera.
Converts IR camera images from NeckFace device into 3D facial expressions. Requires a short calibration round (~5 minutes) where participants copy facial movements displayed on an iPhone with TrueDepth camera.
Trained on reconstructed facial reactions to detect errors in robot (or human) actions. Supports both cross-participant generalization and single-participant personalization with minimal data.
| Model Type | Dataset | Accuracy | F1-Score |
|---|---|---|---|
| GRU_FCN | NeckData | 65.8% | 63.7% |
| gMLP | OpenData | 60.6% | 53.5% |
| GRU_FCN (5% train) | NeckData | 84.7% | 84.2% |
| InceptionTime (5% train) | OpenData | 78.8% | 78.2% |
NeckData models consistently outperform OpenData (OpenFace) models on single-participant scenarios, especially with limited training data.
@inproceedings{parreira2025whyface,
title={"Why the face?": Exploring Robot Error Detection
Using Instrumented Bystander Reactions},
author={Parreira, Maria Teresa and Zhang, Ruidong and
Lingaraju, Sukruth Gowdru and Bremers, Alexandra and
Fang, Xuanyu and Ramirez-Aristizabal, Adolfo and
Saha, Manaswi and Kuniavsky, Michael and
Zhang, Cheng and Ju, Wendy},
booktitle={Proceedings of ACM Conference},
year={2025},
organization={ACM}
}
Bystander Affect Detection dataset for HRI failure detection by Bremers et al.
Read Paper →Multimodal error detection challenge at HRI conference by Spitale et al.
Learn More →