Fork me on GitHub

Trending arXiv

Note: this version is tailored to @Smerity - though you can run your own! Trending arXiv may eventually be extended to multiple users ...


MarioQA: Answering Questions by Watching Gameplay Videos

Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, Bohyung Han

We present a new benchmark dataset for video question answering (VideoQA) designed to evaluate algorithms' capability of spatio-temporal event understanding. Existing datasets either require very high-level reasoning from multi-modal information to find answers, or is mostly composed of the questions that can be answered by watching a single frame. Therefore, they are not suitable to evaluate models' real capacity and flexibility for VideoQA. To overcome such critical limitations, we focus on event-centric questions that require understanding temporal relation between multiple events in videos. An interesting idea in dataset construction process is that question-answer pairs are automatically generated from Super Mario video gameplays given a set of question templates. We also tackle VideoQA problem in the new dataset, referred to as MarioQA, by proposing spatio-temporal attention models based on deep neural networks. Our experiments show that the proposed deep neural network models with attention have meaningful performance improvement over several baselines.

Captured tweets and retweets: 2