Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives

Politecnico di Torino  
simone.peirone@polito.it
ArXiv (Feb. 2025)

Abstract

Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.

What can we learn from a video?

Different egocentric video tasks can provide different, and possibly complementary, perspectives on what the user is doing. For example, learning to recognise human actions can give clues as to which objects are being manipulated or what will happen next.

How can we learn from these perspectives?

Different approaches can be used to learn from these tasks. Single task models learn unique weights for each task. Multi-task learning shares a common backbone among the different tasks, with small task-specific heads on top. Task translation is a more recent approach that learns to translate the contributions of different task to solve one of them.

All these approaches have some limitations. For example, multi-task learning can share weights across different tasks, but it does not explicitly model cross-task sinergies and can lead to negative transfer between tasks.
Likewise, the cross-task translation mechanism proposed by EgoT2 combines perspectives from different tasks but it needs to know all tasks before-hand and requires separate models for each task.


A new paradigm for Egocentric Video Understanding

We propose an approach for egocentric video understanding that focuses on knowledge reuse across different tasks. To do so, we adopt a graph-based shared model and the goal is to outperform single and multi-task baselines adapted to our scenario.

Proposed Architecture

Our approach is called Hier-EgoPack and is composed of two stages. In the first stage, a multi-task model is trained on a set of K known tasks. In the second stage, the model is fine-tuned on a novel task using Hier-EgoPack’s cross-task interaction mechanism.

Step 1: MTL Pre-training step

To share the same model across different tasks, we propose to model video as graphs whose nodes correspond to temporal segments of the video, edges connect temporally close nodes and egocentric video tasks can be represented as different graph operations.



In this framework, we can implement temporal reasoning as a hierarchical graph neural network that learns progressively coarsened representations of the input video.



Finally, a set of task-specific heads project the nodes in the output space of each task.

Temporal Distance Gated Convolution (TDGC)
At the core of Hier-EgoPack is Temporal Distance Gated Convolution (TDGC), a novel GNN layer for egocentric vision tasks that require a strong sense of time, i.e., the ability to effectively reason on the order of the events in a video.

Step 2: Novel Task Learning

These heads model different and complementary perspectives on the content of the video. We can collect these perspectives in a set of action-wise task-specific prototypes.

We call these prototypes a backpack of skills and they represent a frozen snapshot of what the model has learnt in the pre-training phase.

  1. When learning a novel task, we feed the temporal features through the task-specific heads of the pre-training tasks.

  2. These features act as queries to look for the closest matching prototypes using k-NN in the features space.

  3. We refine the task features using Message Passing with task prototypes.

Experimental results

We validate Hier-EgoPack on five Ego4d benchmarks: Action Recognition (AR), Long Term Anticipation (LTA), Object State Change Classification (OSCC), Point of No Return (PNR) and Moment Queries (MQ).

Main results on Ego4d benchmarks

Activation frequency for the task-specific prototypes from different support tasks

We focus on the Top-20 most activated prototypes across the support tasks. LTA and OSCC have more uniform activations across different support tasks, i.e., they look at similar prototypes, while MQ exhibit more diverse activations.

Activations consensus for different novel tasks

Activations consensus between two support tasks is defined as the percentage of their prototypes corresponding to the same label activated by the two tasks. Fine-grained tasks, i.e., AR, OSCC and LTA, have higher average consensus. On the contrary, MQ has lower average consensus.

Cite us

@article{peirone2025backpack,
  title   = {Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives},
  author  = {Peirone, Simone Alberto and Pistilli, Francesca and Alliegro, Antonio and Tommasi, Tatiana and Averta, Giuseppe},
  journal = {arXiv preprint arXiv:2502.02487},
  year    = {2025}
}

Please consider also citing our original CVPR publication:
@InProceedings{peirone2024backpack,
  author    = {Peirone, Simone Alberto and Pistilli, Francesca and Alliegro, Antonio and Averta, Giuseppe},
  title     = {A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2024},
  pages     = {18275-18285}
}