MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

¹University of Maryland, College Park ²Capital One

Key Features

MORSE-500 addresses critical limitations in existing multimodal reasoning benchmarks through several key innovations that push beyond static image analysis into dynamic video understanding:

🚀 Fresh & Portable

500 newly cooked video clips with CSV metadata that runs fast and streams efficiently

📈 Scalable Difficulty

Videos are generated programmatically so we can dial up complexity and release harder versions as models improve

🎯 Diverse Categories

Spanning Abstract, Mathematical, Physical, Planning, Spatial, Temporal (+ Causal) – a vibrant mix of the reasoning types that matter

⏰ Temporal Complexity

Video-based tasks requiring understanding of dynamic sequences, causal chains, and temporal relationships that unfold over time – something static images simply cannot capture

👁️ Pure Visual Reasoning

Questions are baked right into the videos. No text crutches, no shortcuts – if you can't see it, you can't solve it

🛠️ Developer-Friendly

A "-view" subset streams directly on Hugging Face, making browsing and debugging smoother than a sunny afternoon

Reasoning Categories

(porportion of data %)

Abstract (12.8%)
Pattern recognition, logical inference, symbolic reasoning

Mathematical (16.8%)
Arithmetic operations, algebraic relations, quantitative comparisons

Physical (12.8%)
Object dynamics, causal interactions, physics laws

Planning (20.0%)
Multi-step reasoning, goal-directed problem solving

Spatial (21.6%)
Object relationships, spatial transformations, 3D reasoning

Temporal (16.0%)
Sequence understanding, causal inference over time

Leaderboard

Rank	Model	Model Type	Date	ALL	Abstract	Math	Physical	Planning	Spatial	Temporal
-	Human	Human 👤	2025-05-28	55.4	37.5	45.5	56.3	56.0	73.1	55.2
1	o3 🥇	Reasoning VLM 🖼️ 💭	2025-05-28	23.6	23.4	27.4	28.1	5.0	29.6	31.2
2	o4-mini 🥈	Reasoning VLM 🖼️ 💭	2025-05-28	22.2	21.9	23.8	29.7	5.0	27.8	28.7
3	Gemini 2.5 Pro 🥉	Reasoning VLM 🖼️ 💭	2025-05-28	21.8	18.8	36.9	29.7	3.0	16.7	32.5
4	o1	Reasoning VLM 🖼️ 💭	2025-05-28	19.8	17.2	22.6	28.1	5.0	23.1	26.2
5	Gemini 2.5 Flash	Reasoning VLM 🖼️ 💭	2025-05-28	19.2	9.4	35.7	28.1	1.0	24.1	18.8
6	Gemini 1.5 Pro	VLM 🎬	2025-05-28	18.8	12.5	21.4	26.6	1.0	26.9	26.2
7	Qwen2.5 VL 72B	VLM 🎬	2025-05-28	17.8	6.2	21.4	34.4	1.0	22.2	25.0
8	GPT 4o	Unified Model 🎭	2025-05-28	17.4	17.2	20.2	34.4	4.0	12.0	25.0
9	Qwen2.5 VL 32B AWQ	VLM 🎬	2025-05-28	16.8	14.1	23.8	34.4	1.0	15.7	18.8
10	Qwen2.5 VL 72B AWQ	VLM 🎬	2025-05-28	16.4	12.5	11.9	29.7	2.0	27.8	16.2
11	Gemini 2.0 Flash	VLM 🎬	2025-05-28	16.0	12.5	29.8	28.1	0.0	13.0	18.8
12	Qwen2.5 VL 32B	VLM 🎬	2025-05-28	15.6	9.4	19.0	29.7	2.0	16.7	21.2
13	Gemma 3 27b	VLM 🖼️	2025-05-28	14.6	20.3	20.2	25.0	1.0	13.0	15.0
14	Gemini 2.0 Flash-Lite	VLM 🎬	2025-05-28	14.2	17.2	21.4	21.9	2.0	14.8	12.5
15	MiniCPM-o 2.6	VLM 🎬	2025-05-28	11.6	4.7	10.7	23.4	1.0	16.7	15.0
16	Qwen2.5 Omni 7B	LMM 🎬🎵	2025-05-28	11.4	6.2	9.5	21.9	2.0	15.7	15.0
17	Qwen2.5 VL 7B	VLM 🎬	2025-05-28	11.2	7.8	11.9	25.0	2.0	12.0	12.5
18	InternVL3 8B	VLM 🖼️	2025-05-28	7.8	6.2	6.0	14.1	1.0	11.1	10.0
19	Qwen2.5 VL 3B	VLM 🎬	2025-05-28	7.6	9.4	3.6	18.8	1.0	9.3	7.5
20	LLaVA-NeXT-Video 7B	VLM 🎬	2025-05-28	5.0	1.6	11.9	6.2	0.0	5.6	5.0

Model Types: 💭 Reasoning • 🖼️ Image • 🎬 Video • 🎵 Audio • 🎭 Unified (Visual Understanding + Generation)

🎯 To submit your results to the leaderboard, please complete this form.

Difficulty Scaling

One of MORSE-500's key innovations is its ability to systematically scale difficulty through programmatic control. The examples below demonstrate how task complexity can be increased while maintaining the core reasoning category.

Scaling Methodology with Examples

To illustrate our scaling approach, consider the frozen lake environment: we can incrementally increase difficulty by expanding the maze size, adding more action options, introducing fog effects, or reducing the agent's visible range. Similarly, other tasks in our benchmark can be scaled by manipulating sequence length and other relevant parameters.

BibTeX

@article{cai2025morse500, title={MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning}, author={Cai, Zikui and Wang, Andrew and Satheesh, Anirudh and Nakhawa, Ankit and Jae, Hyunwoo and Powell, Keenan and Liu, Minghui and Jay, Neel and Oh, Sungbin and Wang, Xiyao and Liang, Yongyuan and Goldstein, Tom and Huang, Furong}, journal={arXiv preprint arXiv:2506.05523}, year={2025} }

MORSE-500