Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.
Human activities are complex, hierarchical and goal-oriented. We question how different video understanding models encode such hierarchy in the features space. We take a video and extract features from dense segments. Then we compute the pairwise feature similarity for the entire video.
We design HiERO based on the intuition that, given a sufficiently large collection of videos capturing human activities in-the-wild, functional dependencies between actions naturally emerge as frequently co-occurring patterns directly from observations. As a result, HiERO's features space allows related actions to be easily grouped into high-level patterns with a simple clustering operation.
We evaluate HiERO on EgoMCQ and EgoNLQ to validate its video-text alignment capabilities and to show that reasoning on functional threads at different scales can support various video understanding tasks. Together, video-narrations alignment and functional clustering are effective to discriminate between similar short-term actions, which is critical for EgoMCQ, as well as to capture long-range causal and temporal dependencies in the video, which is essential for EgoNLQ.
We evaluate HiERO on EgoProceL in zero-shot, using visual features extracted from the Omnivore and EgoVLP backbones. we evaluate on this benchmark the ability of HiERO to group together parts of the video that correspond to the same high-level activity by leveraging their functional similarity, even though it was not trained explicitly to identify procedure steps inside a video.
We also evaluate HiERO on the Step-Grounding and Step Localization tasks from Ego4D Goal-Step, both in zero-shot and supervised settings. Compared to supervised approaches that learn a direct mapping between the video and the procedure steps, we argue that the steps detected by HiERO emerge as composition of low-level patterns that are clustered together.
@article{peirone2025hiero,
title={HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos},
author={Peirone, Simone Alberto and Pistilli, Francesca and Averta, Giuseppe},
journal={arXiv preprint arXiv:2505.12911},
year={2025}
}