There is a large amount of multimodal knowledge stored in movies. Movies are a great resource of common-sense knowledge about the world, about actions and their effects, about people’s behaviors and emotions, and about stories. Movies provide rich visual content covering large periods of time, telling a full story with rich interactions, emotions, and events. Our goal is to produce a system (algorithm + database) that can take an image or video as input and produce a description of the semantic content of the scene, what the people are doing, and understand the situation depicted in the scene. To achieve this goal, we propose to build a knowledge database covering a very wide and varied number of situations and train a system to parse scenes and provide complex descriptions of the content.
Network dissection and understanding the internal representation learned by dynamic neural networks. Network dissection consists in a family of methods to characterize the internal representation learned by a neural network when trying to solve a task. Characterizing the internal representation built by a neural network opens the door to new approach for unsupervised object discovery and to do unsupervised learning of common-sense knowledge.
Understanding movies. Understanding a movie requires analyzing the video at different time scales and reasoning about different types of events. Following the gaze of people inside videos is an important signal for understanding people and their actions. In this project, we present an approach for the following gaze in the video by predicting where a person (in the video) is looking even when the object is in a different frame. This system can then be deployed to solve a variety of tasks: movie understanding, activity recognition, and social interaction prediction.