Bad Idea or Good Prediction? Comparing VLM and Human Anticipatory Judgment

Anonymous Author(s)
HRI 2026
Research Questions Overview


Can VLMs predict whether situations will end well or poorly? We evaluate both direct scenario analysis (RQ1) and the ability to interpret human reactions (RQ2).

Abstract

Anticipatory reasoning – predicting whether situations will resolve positively or negatively by interpreting contextual cues – is crucial for robots operating in human environments. This exploratory study evaluates whether Vision Language Models (VLMs) possess such predictive capabilities through two complementary approaches.

First, we test VLMs on direct outcome prediction by inputting videos of human and robot scenarios with outcomes removed, asking the models to predict whether situations will end well or poorly. Second, we introduce a novel evaluation of anticipatory social intelligence: can VLMs predict outcomes by analyzing human facial reactions of people watching these scenarios?

We tested multiple VLMs, including closed-source and open-source, using various prompts and compared their predictions against both true outcomes and judgments from 29 human participants. The best-performing VLM (Gemini 2.0 Flash) achieved 70.0% accuracy in predicting true outcomes, outperforming the average individual human (62.1% ± 6.2%). Agreement with individual human judgments ranged from 44.4% to 69.7%.

Critically, VLMs struggled to predict outcomes by analyzing human facial reactions, suggesting limitations in leveraging social cues. These preliminary findings indicate that while some VLMs show promise for anticipatory reasoning, their performance is sensitive to model selection and prompt design, with current limitations in social intelligence that warrant further investigation for human-robot interaction applications.

Research Questions

We investigate anticipatory reasoning through two complementary approaches:

  • RQ1: What are the anticipatory reasoning capabilities of VLMs (i.e., predicting outcomes on the scenario dataset)?
  • RQ2: What is the anticipatory social intelligence of VLMs (i.e., predicting anticipated outcomes based on human facial reactions)?

Dataset

Hoverboard scenario

Example: Hoverboard scenario

Robot scenario

Example: Robot scenario

We used 30 videos from the "Bad Idea?" study, featuring scenarios where humans and robots are shown before outcomes are revealed. Videos include everyday situations with both good and bad outcomes. 29 participants from an online study provided baseline human judgments.

Models Tested

We evaluated both closed-source and open-source VLMs:

Closed-Source VLMs (API-accessed):

  • GPT-4o
  • Gemini 2.0 Flash
  • Qwen2.5-vl (72b)

Open-Source VLMs (locally deployed via Ollama):

  • DeepSeek-OCR (3b)
  • Gemma 3 (4b)
  • LLaVA-LLaMA 3 (8b)

Key Results

RQ1: Anticipatory Reasoning of VLMs

Ground Truth Performance:

  • The best-performing closed-source VLM (Gemini 2.0 Flash with Prompt A) achieved 70.0% accuracy, exceeding average human performance (62.1% ± 6.2%)
  • Performance varied substantially across models (43.3% to 70.0%)
  • Open-source models demonstrated competitive but generally lower performance, with the best configuration (LLaVA-LLaMA 3 with Prompt B) reaching 63.3% accuracy
  • Some models exhibited severe prediction bias (e.g., DeepSeek-OCR and Gemma3 predicted only one type of outcome)
  • Prompt engineering impacted performance even within the same model, with variations up to 6.7 percentage points

Agreement with Individual Human Predictions:

  • Gemini 2.0 Flash achieved 69.7% ± 7.0% agreement with individual human judgments
  • Models showing higher agreement with individual human predictions did not necessarily achieve higher accuracy on true outcomes, suggesting shared biases

RQ2: Anticipatory Social Intelligence

  • VLMs performed poorly at predicting outcomes from human anticipatory reactions, with accuracy ranging from 44.5% to 53.8%
  • Several models exhibited severe prediction bias, suggesting difficulty in discriminating between different outcome types
  • No significant performance difference emerged between 1-second and 3-second temporal windows
  • These results hint at limitations in current VLMs' ability to interpret human facial expressions and anticipatory reactions as predictive signals

Main Takeaways

  1. Some VLMs can exceed human performance: Gemini 2.0 Flash achieved 70.0% accuracy on scenario prediction, exceeding average human performance (62.1%). However, considerable fragility emerged, with performance varying up to 26.7 percentage points across models and up to 6.7 percentage points across prompts within some models.
  2. Performance gap between closed-source and open-source models: The best closed-source model (70.0%) outperformed the best open-source model (63.3%), which is unsurprising given differences in scale, training resources, and fine-tuning.
  3. Limitations in social intelligence: VLMs analyzing human facial reactions achieved only 47.9% agreement with human predictions, substantially below direct scenario prediction performance (63-70%). This suggests a critical gap in VLMs' ability to interpret human anticipatory states and emotional signals – capabilities essential for effective HRI.
  4. Prompt engineering matters: Variations of up to 6.7% within the same model highlight the importance of careful prompt design for anticipatory reasoning tasks.

Implications for Human-Robot Interaction

  • VLMs show potential for proactive error prevention in robots, though performance is highly model-dependent
  • Current limitations in social intelligence may significantly constrain applications requiring interpretation of human emotional states and anticipatory behaviors
  • Careful model selection and prompt engineering are critical for safety-sensitive applications

BibTeX

@inproceedings{badidea2026,
  title={Bad Idea or Good Prediction? Comparing VLM and Human Anticipatory Judgment},
  author={Anonymous Author(s)},
  booktitle={Proceedings of ACM/IEEE International Conference on Human-Robot Interaction (HRI)},
  year={2026}
}