We present a new benchmark dataset for video question answering (VideoQA) designed to evaluate algorithms' capability of spatio-temporal event understanding. Existing datasets either require very high-level reasoning from multi-modal information to find answers, or is mostly composed of the questions that can be answered by watching a single frame. Therefore, they are not suitable to evaluate models' real capacity and flexibility for VideoQA. To overcome such critical limitations, we focus on event-centric questions that require understanding temporal relation between multiple events in videos. An interesting idea in dataset construction process is that question-answer pairs are automatically generated from Super Mario video gameplays given a set of question templates. We also tackle VideoQA problem in the new dataset, referred to as MarioQA, by proposing spatio-temporal attention models based on deep neural networks. Our experiments show that the proposed deep neural network models with attention have meaningful performance improvement over several baselines.
Captured tweets and retweets: 2