Anticipatory reasoning – predicting whether situations will resolve positively or negatively by interpreting contextual cues – is crucial for robots operating in human environments. This exploratory study evaluates whether Vision Language Models (VLMs) possess such predictive capabilities through two complementary approaches.
First, we test VLMs on direct outcome prediction by inputting videos of human and robot scenarios with outcomes removed, asking the models to predict whether situations will end well or poorly. Second, we introduce a novel evaluation of anticipatory social intelligence: can VLMs predict outcomes by analyzing human facial reactions of people watching these scenarios?
We tested multiple VLMs, including closed-source and open-source, using various prompts and compared their predictions against both true outcomes and judgments from 29 human participants. The best-performing VLM (Gemini 2.0 Flash) achieved 70.0% accuracy in predicting true outcomes, outperforming the average individual human (62.1% ± 6.2%). Agreement with individual human judgments ranged from 44.4% to 69.7%.
Critically, VLMs struggled to predict outcomes by analyzing human facial reactions, suggesting limitations in leveraging social cues. These preliminary findings indicate that while some VLMs show promise for anticipatory reasoning, their performance is sensitive to model selection and prompt design, with current limitations in social intelligence that warrant further investigation for human-robot interaction applications.
We investigate anticipatory reasoning through two complementary approaches:
Example: Hoverboard scenario
Example: Robot scenario
We used 30 videos from the "Bad Idea?" study, featuring scenarios where humans and robots are shown before outcomes are revealed. Videos include everyday situations with both good and bad outcomes. 29 participants from an online study provided baseline human judgments.
We evaluated both closed-source and open-source VLMs:
Closed-Source VLMs (API-accessed):
Open-Source VLMs (locally deployed via Ollama):
Ground Truth Performance:
Agreement with Individual Human Predictions:
@inproceedings{badidea2026,
title={Bad Idea or Good Prediction? Comparing VLM and Human Anticipatory Judgment},
author={Anonymous Author(s)},
booktitle={Proceedings of ACM/IEEE International Conference on Human-Robot Interaction (HRI)},
year={2026}
}