A Backpack Full of Skills:
Egocentric Video Understanding with Diverse Task Perspectives

1Politecnico di Torino,   2Istituto Italiano di Tecnologia
firstname.lastname@polito.it
Computer Vision and Pattern Recognition (CVPR) 2024

Given a video stream, a robot is asked to learn a novel task, e.g. Object State Change Classification (OSCC). To learn the new skill, the robot can access previously gained knowledge regarding different tasks, such Point of No Return (PNR), Long Term Anticipation (LTA) and Action Recognition (AR), and use it during the learning process to enhance downstream task performance. This knowledge is stored as graphs inside the robot's backpack, always ready to boost a new skill.

Abstract

Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks, to synergistically exploit them when learning novel skills. To accomplish this, we seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead, to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4d benchmarks, outperforming current state-of-the-art methods.

Alan, a skilled marmot with a backpack full of skills ready for the next adventure (Credits: DALLE-3).

Cite us

@misc{peirone2024backpack,
    title={A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives}, 
    author={Simone Alberto Peirone and Francesca Pistilli and Antonio Alliegro and Giuseppe Averta},
    year={2024},
    eprint={2403.03037},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}