Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.
Different egocentric video tasks can provide different, and possibly complementary, perspectives on what the user is doing. For example, learning to recognise human actions can give clues as to which objects are being manipulated or what will happen next.
Different approaches can be used to learn from these tasks. Single task models learn unique weights for each task. Multi-task learning shares a common backbone among the different tasks, with small task-specific heads on top. Task translation is a more recent approach that learns to translate the contributions of different task to solve one of them.
All these approaches have some limitations. For example,
multi-task learning can share weights across different tasks,
but it does not explicitly model cross-task sinergies and can
lead to negative transfer between tasks.
Likewise, the cross-task translation mechanism proposed by
EgoT2 combines perspectives from different tasks but it
needs to know all tasks before-hand and requires separate
models for each task.
We propose an approach for egocentric video understanding that focuses on knowledge reuse across different tasks. To do so, we adopt a graph-based shared model and the goal is to outperform single and multi-task baselines adapted to our scenario.
Our approach is called Hier-EgoPack and is composed of two stages. In the first stage, a multi-task model is trained on a set of K known tasks. In the second stage, the model is fine-tuned on a novel task using Hier-EgoPack’s cross-task interaction mechanism.
To share the same model across different tasks, we propose to model video as graphs whose nodes correspond to temporal segments of the video, edges connect temporally close nodes and egocentric video tasks can be represented as different graph operations.
In this framework, we can implement temporal reasoning as a hierarchical graph neural network that learns progressively coarsened representations of the input video.
Finally, a set of task-specific heads project the nodes in the output space of each task.
These heads model different and complementary perspectives on
the content of the video. We can collect these perspectives in
a set of action-wise task-specific prototypes.
We call these prototypes a backpack of skills and they
represent a frozen snapshot of what the model has learnt in
the pre-training phase.
We validate Hier-EgoPack on five Ego4d benchmarks: Action Recognition (AR), Long Term Anticipation (LTA), Object State Change Classification (OSCC), Point of No Return (PNR) and Moment Queries (MQ).
We focus on the Top-20 most activated prototypes across the support tasks. LTA and OSCC have more uniform activations across different support tasks, i.e., they look at similar prototypes, while MQ exhibit more diverse activations.
Activations consensus between two support tasks is defined as the percentage of their prototypes corresponding to the same label activated by the two tasks. Fine-grained tasks, i.e., AR, OSCC and LTA, have higher average consensus. On the contrary, MQ has lower average consensus.
@article{peirone2025backpack,
title = {Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives},
author = {Peirone, Simone Alberto and Pistilli, Francesca and Alliegro, Antonio and Tommasi, Tatiana and Averta, Giuseppe},
journal = {arXiv preprint arXiv:2502.02487},
year = {2025}
}
@InProceedings{peirone2024backpack,
author = {Peirone, Simone Alberto and Pistilli, Francesca and Alliegro, Antonio and Averta, Giuseppe},
title = {A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {18275-18285}
}