Event Understanding

Research

Skip Over Breadcrumbs and Secondary Navigation

Video data captures a tremendous amount of data that encompasses both visual and semantic knowledge. Traditional approaches to video activity understanding is based on training machine learning models or, more recently, a variety of deep learning approaches to capture underlying semantics of the video using human-annotated training data. However, this restricts the trained models to the ontology given by the annotations. A deeper understanding of video activities extends beyond recognition of underlying concepts such as actions and objects: constructing deep semantic representations requires reasoning about the semantic relationships among these concepts, often beyond what is directly observed in the data.

Projects

A conceptual diagram illustrating the relationships between entities such as "person," "music," "play," and "guitar," with various connections labeled as "HasA," "RelatedTo," and "IsA." Arrows indicate desired interactions and associative features among these entities.

Title: Pattern Theory-based and Commonsense Knowledge for Event Interpretation

Description: We propose an energy minimization framework that leverages large-scale commonsense knowledge bases, such as ConceptNet, to provide contextual cues to establish semantic relationships among entities directly hypothesized from video signal. We mathematically express this using the language of Grenander's canonical pattern generator theory. We show that the use of prior encoded commonsense knowledge alleviate the need for large annotated training datasets and help tackle imbalance in training through prior knowledge. Through extensive experiments, we show that the use of commonsense knowledge from ConceptNet allows the proposed approach to handle various challenges such as training data imbalance, weak features, and complex semantic relationships and visual scenes. We also find that the use of commonsense knowledge allows for highly interpretable models that can be used in a dialog model for better human-machine interaction.

Publications

A person is seen performing a sequence of cooking actions, including taking a bowl, cracking eggs, and spooning flour, captured in three sequential images. Below, graphs depict observed and predicted features, along with self-supervised error metrics, indicating analysis of the actions.

Title: Self Supervised Event Segmentation

Description: Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data. In this paper, we tackle the problem of selfsupervised temporal segmentation that alleviates the need for any supervision in the form of labels (full supervision) or temporal ordering (weak supervision). We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into constituent events. Learning involves only a single pass through the training data. We also introduce a new adaptive learning paradigm that helps reduce the effect of catastrophic forgetting in recurrent neural networks. Extensive experiments on publicly available datasets show the efficacy of the proposed approach. We show that the proposed approach outperforms weakly-supervised and unsupervised baselines and achieves competitive segmentation results compared to fully supervised baselines with only a single pass through the training data. Finally, we show that the proposed self-supervised learning paradigm learns highly discriminating features to improve action recognition.

91ֹחַר

Institute for Artificial Intelligence + X

Research