RESEARCH PROGRAMS & PROJECTS
Program 1 - Visual Intelligence
RP1-2 & RP1-4
Building Cognitive - Inspired Visual Intelligence Models & Towards intelligenT agents with generalizable Knowledges
The Embodied AI direction Prof. Wang Liwei led in CPII focuses on natural language processing and computer vision research. His research aims to create intelligent machines that can understand the surrounding visual world, communicate in natural language, and interact with the environment. On the perception level, intelligent machines can recognize visual content and describe them in natural language. On the interaction level, intelligent machines should understand the visual scene and be able to act and finish the task by doing navigation and conducting actions. More importantly, to consider and plan for the long-term consequences of their actions, the group is also developing algorithms that can do reasoning based on the multi-modal inputs. Beyond the directions mentioned above, the group is also devoted to solving some of the most challenging problems in real scenarios, such as understanding large AI models and making them adaptively suitable for many demanding tasks.
Interpretable models for movie understanding
There is a large amount of multimodal knowledge stored in movies. Movies are a great resource of common-sense knowledge about the world, about actions and their effects, about people’s behaviors and emotions, and about stories. Movies provide rich visual content covering large periods of time, telling a full story with rich interactions, emotions, and events. Our goal is to produce a system (algorithm + database) that can take an image or video as input and produce a description of the semantic content of the scene, what the people are doing, and understand the situation depicted in the scene. To achieve this goal, we propose to build a knowledge database covering a very wide and varied number of situations and train a system to parse scenes and provide complex descriptions of the content.
Aim 1: Network dissection and understanding the internal representation learned by dynamic neural networks. Network dissection consists in a family of methods to characterize the internal representation learned by a neural network when trying to solve a task. Characterizing the internal representation built by a neural network opens the door to new approach for unsupervised object discovery and to do unsupervised learning of common-sense knowledge.
Aim 2: Understanding movies. Understanding a movie requires analyzing the video at different time scales and reasoning about different types of events. Following the gaze of people inside videos is an important signal for understanding people and their actions. In this project, we present an approach for the following gaze in the video by predicting where a person (in the video) is looking even when the object is in a different frame. This system can then be deployed to solve a variety of tasks: movie understanding, activity recognition, and social interaction prediction.
Universal Representations for exchanging and integrating visual knowledges
The recent developments in deep learning, while having great impacts in various application domains. There are some clear advantages of model-agnostic approaches: domain knowledge is no longer necessary for data processing, yet good performance can often be achieved due to a holistic view of the data. However, there are plenty of side information and domain knowledge can hardly be utilized in the current DNN solutions. The relatively low efficiency of using data makes the sample complexity and computation complexity quickly grow to and beyond the capacity of supercomputers.
We envision that the next generation of data processing infrastructure, which fundamentally solves these problems, by focusing on processing knowledge instead of the raw data. Knowledge represented in a compact form should typically have orders of magnitude lower dimensionality than the raw data, and thus can be stored, exchanged, and managed efficiently. We believe that understanding, exchanging, and integrating such knowledge is the key step towards building scalable, multi-purpose, and secure data infrastructure of the next generation.
We aim at developing theories and algorithms for 1) modeling a unified knowledge representation, which allows knowledge learned from one data source to be used for other inference tasks and facilitates knowledge exchange between multiple tasks, 2) allows interpretations to the knowledge learned from data, 3) can perform multi-domain learning by jointly processing multiple data sources, 4) can incorporate domain knowledge, and 4) tolerates noise and quantization errors in the storage and communications and thus allows scalable system implementations.
Towards learning the unified knowledge representation, we conduct research activities in the directions for learning knowledge for understanding 3D point cloud data, 3D object pose from 2D images, understanding action localization from videos, and modeling noise-free signals from noisy raw signals.
Real-time decentraliZed video analytics on the edge
The impressive accuracy of deep neural networks (DNNs) has created great demands on practical analytics over video data. Although efficient and accurate, the latest video analytic systems have not supported analytics beyond selection and aggregation queries. In data analytics, Top-K is a very important analytical operation that enables analysts to focus on the most important entities. Everest is the first system that supports efficient and accurate Top-K video analytics. Everest ranks and identifies the most interesting frames/clips from videos with probabilistic guarantees. Furthermore, it supports user-defined functions to rank frames/clips based on different semantics using different deep vision models. Everest leverages techniques from computer vision, uncertain databases, and Top-K query processing to return results quickly.