MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

University of Maryland, College Park

MORSE-500 provides 500 programmatically generated video instances across six reasoning categories: abstract, mathematical, physical, planning, spatial, and temporal.

Example Videos

Key Features

MORSE-500 addresses critical limitations in existing multimodal reasoning benchmarks through several key innovations that push beyond static image analysis into dynamic video understanding:

🚀 Fresh & Portable

500 newly cooked video clips with CSV metadata that runs fast and streams efficiently

📈 Scalable Difficulty

Videos are generated programmatically so we can dial up complexity and release harder versions as models improve

🎯 Diverse Categories

Abstract, Mathematical, Physical, Planning, Spatial, Temporal (+ Causal) – evenly distributed across reasoning types that actually matter

⏰ Temporal Complexity

Video-based tasks requiring understanding of dynamic sequences, causal chains, and temporal relationships that unfold over time – something static images simply cannot capture

👁️ Pure Visual Reasoning

Questions are baked right into the videos. No text crutches, no shortcuts – if you can't see it, you can't solve it

🛠️ Developer-Friendly

A "-view" subset streams directly on Hugging Face for quick browsing and debugging

Reasoning Categories

(porportion of data %)

Abstract (12.8%)
Pattern recognition, logical inference, symbolic reasoning
Mathematical (16.8%)
Arithmetic operations, algebraic relations, quantitative comparisons
Physical (12.8%)
Object dynamics, causal interactions, physics laws
Planning (20.0%)
Multi-step reasoning, goal-directed problem solving
Spatial (21.6%)
Object relationships, spatial transformations, 3D reasoning
Temporal (16.0%)
Sequence understanding, causal inference over time

Leaderboard

Rank Model Model Type Date ALL Abstract Math Physical Planning Spatial Temporal
- Human Human 👤 2025-05-28 55.4 37.5 45.5 56.3 56.0 73.1 55.2
1 o3 🥇 Reasoning VLM 🖼️ 💭 2025-05-28 23.6 23.4 27.4 28.1 5.0 29.6 31.2
2 o4-mini 🥈 Reasoning VLM 🖼️ 💭 2025-05-28 22.2 21.9 23.8 29.7 5.0 27.8 28.7
3 Gemini 2.5 Pro 🥉 Reasoning VLM 🖼️ 💭 2025-05-28 21.8 18.8 36.9 29.7 3.0 16.7 32.5
4 o1 Reasoning VLM 🖼️ 💭 2025-05-28 19.8 17.2 22.6 28.1 5.0 23.1 26.2
5 Gemini 2.5 Flash Reasoning VLM 🖼️ 💭 2025-05-28 19.2 9.4 35.7 28.1 1.0 24.1 18.8
6 Gemini 1.5 Pro VLM 🎬 2025-05-28 18.8 12.5 21.4 26.6 1.0 26.9 26.2
7 Qwen2.5 VL 72B VLM 🎬 2025-05-28 17.8 6.2 21.4 34.4 1.0 22.2 25.0
8 GPT 4o Unified Model 🎭 2025-05-28 17.4 17.2 20.2 34.4 4.0 12.0 25.0
9 Qwen2.5 VL 32B AWQ VLM 🎬 2025-05-28 16.8 14.1 23.8 34.4 1.0 15.7 18.8
10 Qwen2.5 VL 72B AWQ VLM 🎬 2025-05-28 16.4 12.5 11.9 29.7 2.0 27.8 16.2
11 Gemini 2.0 Flash VLM 🎬 2025-05-28 16.0 12.5 29.8 28.1 0.0 13.0 18.8
12 Qwen2.5 VL 32B VLM 🎬 2025-05-28 15.6 9.4 19.0 29.7 2.0 16.7 21.2
13 Gemma 3 27b VLM 🖼️ 2025-05-28 14.6 20.3 20.2 25.0 1.0 13.0 15.0
14 Gemini 2.0 Flash-Lite VLM 🎬 2025-05-28 14.2 17.2 21.4 21.9 2.0 14.8 12.5
15 MiniCPM-o 2.6 VLM 🎬 2025-05-28 11.6 4.7 10.7 23.4 1.0 16.7 15.0
16 Qwen2.5 Omni 7B LMM 🎬🎵 2025-05-28 11.4 6.2 9.5 21.9 2.0 15.7 15.0
17 Qwen2.5 VL 7B VLM 🎬 2025-05-28 11.2 7.8 11.9 25.0 2.0 12.0 12.5
18 InternVL3 8B VLM 🖼️ 2025-05-28 7.8 6.2 6.0 14.1 1.0 11.1 10.0
19 Qwen2.5 VL 3B VLM 🎬 2025-05-28 7.6 9.4 3.6 18.8 1.0 9.3 7.5
20 LLaVA-NeXT-Video 7B VLM 🎬 2025-05-28 5.0 1.6 11.9 6.2 0.0 5.6 5.0

Model Types: 💭 Reasoning • 🖼️ Image • 🎬 Video • 🎵 Audio • 🎭 Unified (Visual Understanding + Generation)

🎯 To submit your results to the leaderboard, please send to this email with your result json files.

Difficulty Scaling

One of MORSE-500's key innovations is its ability to systematically scale difficulty through programmatic control. The examples below demonstrate how task complexity can be increased while maintaining the core reasoning category.

Scaling Methodology with Examples

To illustrate our scaling approach, consider the frozen lake environment: we can incrementally increase difficulty by expanding the maze size, adding more action options, introducing fog effects, or reducing the agent's visible range. Similarly, other tasks in our benchmark can be scaled by manipulating sequence length and other relevant parameters.

BibTeX

@article{cai2025morse500,
  title={MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning},
  author={Cai, Zikui and Wang, Andrew and Satheesh, Anirudh and Nakhawa, Ankit and Jae, Hyunwoo and Powell, Keenan and Liu, Minghui and Jay, Neel and Oh, Sungbin and Wang, Xiyao and Liang, Yongyuan and Goldstein, Tom and Huang, Furong},
  journal={arXiv preprint arXiv:xxxx},
  year={2025}
}