"Why the face?" Robot Error Detection Using Instrumented Bystander Reactions

Figure 1: User study scheme. Participants wearing NeckFace watch videos where a scenario of human or robot error is shown, eliciting a reaction. The IR camera image is converted into 3D facial points and head rotation data through a customized NeckNet model. This data is then used to train error detection models which map human reactions to the scenario displayed.

Abstract

How do humans recognize and rectify social missteps? We achieve social competence by looking around at our peers, decoding subtle cues from bystanders — a raised eyebrow, a laugh — to evaluate the environment and our actions. Robots, however, struggle to perceive and make use of these nuanced reactions.

By employing a novel neck-mounted device that records facial expressions from the chin region, we explore the potential of previously untapped data to capture and interpret human responses to robot error. First, we develop NeckNet-18, a 3D facial reconstruction model to map the reactions captured through the chin camera onto facial points and head motion. We then use these facial responses to develop a robot error detection model which outperforms standard methodologies such as using OpenFace or video data, generalizing well especially for within-participant data.

Through this work, we argue for expanding human-in-the-loop robot sensing, fostering more seamless integration of robots into diverse human environments, pushing the boundaries of social cue detection and opening new avenues for adaptable and sustainable robotics.

Key Contributions

🎯 NeckNet-18

A lightweight 3D facial reconstruction model (ResNet-18 based) that converts IR camera data from a neck-mounted device into:

52 facial Blendshape parameters
3 head rotation angles

📊 Error Detection Models

Machine learning models that detect robot errors from human facial reactions:

84.7% accuracy with only 5% training data (intra-participant)
5% better than OpenFace methods

🔬 Comprehensive Benchmark

First systematic comparison of neck-mounted device data against conventional methods:

NeckFace IR cameras
OpenFace features
RGB camera data (with CNN)

System Architecture

Figure 2: Study setup showing the calibration and stimulus rounds. Participants wear NeckFace while watching stimulus videos, with reactions recorded by both the neck-mounted IR cameras and a frontal RGB camera.

Two-Stage Pipeline

Stage 1: NeckNet-18 (3D Facial Reconstruction)

Converts IR camera images from NeckFace device into 3D facial expressions. Requires a short calibration round (~5 minutes) where participants copy facial movements displayed on an iPhone with TrueDepth camera.

Stage 2: Error Detection Model

Trained on reconstructed facial reactions to detect errors in robot (or human) actions. Supports both cross-participant generalization and single-participant personalization with minimal data.

Results

Error Detection Performance

Model Type	Dataset	Accuracy	F1-Score
GRU_FCN	NeckData	65.8%	63.7%
gMLP	OpenData	60.6%	53.5%
GRU_FCN (5% train)	NeckData	84.7%	84.2%
InceptionTime (5% train)	OpenData	78.8%	78.2%

NeckData models consistently outperform OpenData (OpenFace) models on single-participant scenarios, especially with limited training data.

Citation

@inproceedings{parreira2025whyface,
  title={"Why the face?": Exploring Robot Error Detection
         Using Instrumented Bystander Reactions},
  author={Parreira, Maria Teresa and Zhang, Ruidong and
          Lingaraju, Sukruth Gowdru and Bremers, Alexandra and
          Fang, Xuanyu and Ramirez-Aristizabal, Adolfo and
          Saha, Manaswi and Kuniavsky, Michael and
          Zhang, Cheng and Ju, Wendy},
  booktitle={Proceedings of ACM Conference},
  year={2025},
  organization={ACM}
}

Related Work

NeckFace (2021)

Original neck-mounted facial expression tracking system by Chen et al.

Read Paper →

BAD Dataset (2023)

Bystander Affect Detection dataset for HRI failure detection by Bremers et al.

Read Paper →

Err@HRI Challenge (2024)

Multimodal error detection challenge at HRI conference by Spitale et al.

Learn More →