MORSE-500 addresses critical limitations in existing multimodal reasoning benchmarks through several key innovations that push beyond static image analysis into dynamic video understanding:
500 newly cooked video clips with CSV metadata that runs fast and streams efficiently
Videos are generated programmatically so we can dial up complexity and release harder versions as models improve
Abstract, Mathematical, Physical, Planning, Spatial, Temporal (+ Causal) – evenly distributed across reasoning types that actually matter
Video-based tasks requiring understanding of dynamic sequences, causal chains, and temporal relationships that unfold over time – something static images simply cannot capture
Questions are baked right into the videos. No text crutches, no shortcuts – if you can't see it, you can't solve it
A "-view" subset streams directly on Hugging Face for quick browsing and debugging