A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

Abstract

Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks, to synergistically exploit them when learning novel skills. To accomplish this, we seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead, to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4d benchmarks, outperforming current state-of-the-art methods.

What can we learn from a video?

Different egocentric video tasks can provide different, and possibly complementary, perspectives on what the user is doing. For example, learning to recognise human actions can give clues as to which objects are being manipulated or what will happen next.

How can we learn from these perspectives?

Different approaches can be used to learn from these tasks. Single task models learn unique weights for each task. Multi-task learning shares a common backbone among the different tasks, with small task-specific heads on top. Task translation is a more recent approach that learns to translate the contributions of different task to solve one of them.

All these approaches have some limitations. For example, multi-task learning can share weights across different tasks, but it does not explicitly model cross-task sinergies and can lead to negative transfer between tasks.
Likewise, the cross-task translation mechanism proposed by EgoT2 combines perspectives from different tasks but it needs to know all tasks before-hand and requires separate models for each task.

A new paradigm for Egocentric Video Understanding

We propose an approach for egocentric video understanding that focuses on knowledge reuse across different tasks. To do so, we adopt a graph-based shared model and the goal is to outperform single and multi-task baselines adapted to our scenario.

Proposed Architecture

Our approach is called EgoPack and is composed of two stages. In the first stage, a multi-task model is trained on a set of K known tasks. In the second stage, the model is fine-tuned on a novel task using egopack’s cross-task interaction mechanism

Step 1: MTL Pre-training step

To share the same model across different tasks, we propose to model video as graphs whose nodes correspond to temporal segments of the video, edges connect temporally close nodes and egocentric video tasks can be represented as different graph operations

With this architecture, we can model temporal reasoning using a graph neural network that iteratively updates the nodes representations using message passing.

Finally, a set of task-specific heads project the nodes in the output space of each task.

Step 2: Novel Task Learning

These heads model different and complementary perspectives on the content of the video. We can collect these perspectives in a set of action-wise task-specific prototypes.

We call these prototypes a backpack of skills and they represent a frozen snapshot of what the model has learnt in the pre-training phase.

When learning a novel task, we feed the temporal features through the task-specific heads of the pre-training tasks.

These features act as queries to look for the closest matching prototypes using k-NN in the features space.

We refine the task features using Message Passing with task prototypes.

Experimental results

We validate EgoPack on Action Recognition (AR), Object State Change Classification (OSCC), Point of No Return (PNR) and Long Term Anticipation (LTA) from Ego4D.

Cross-tasks agreement ratio

How much different perspectives bring complementary information?

Queried prototypes

When the novel task is OSCC, what are the closest prototypes from the AR and PNR tasks?

Cite us


@InProceedings{peirone2024backpack,
  author    = {Peirone, Simone Alberto and Pistilli, Francesca and Alliegro, Antonio and Averta, Giuseppe},
  title     = {A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2024},
  pages     = {18275-18285}
}

A Backpack Full of Skills:Egocentric Video Understanding with Diverse Task Perspectives