HiERO:
understanding the hierarchy of human behavior enhances reasoning on egocentric videos

Politecnico di Torino  
simone.peirone@polito.it
Zero-Shot procedure step localization with HiERO. Given a long video, HiERO computes segment-level features that encode the functional dependencies between its actions at different scales, enabling the detection of procedure steps through a simple clustering in feature space.

Abstract

Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.


💡 How are human actions modeled by video models?

Human activities are complex, hierarchical and goal-oriented. We question how different video understanding models encode such hierarchy in the features space. We take a video and extract features from dense segments. Then we compute the pairwise feature similarity for the entire video.

Emergence of step clusters in the features similarity matrix of a video from Ego4D
Colored rectangles indicate the ground truth steps. Ideally, we expect high similarity if two segments represent semantically similar steps. With Omnivore features, this behavior is only partially visible. On EgoVLP features, we observe sharper clusters of temporal segments that are not necessarily close temporally, but represent similar high-level actions. Our approach makes this behavior even more visible.

🔧 Learning high-level human activities without supervision

We design HiERO based on the intuition that, given a sufficiently large collection of videos capturing human activities in-the-wild, functional dependencies between actions naturally emerge as frequently co-occurring patterns directly from observations. As a result, HiERO's features space allows related actions to be easily grouped into high-level patterns with a simple clustering operation.


HiERO is built on two objectives:

Video-Narrations Alignment
Align segments of a video with their corresponding narrations to learn pattern of related actions.
Functional Threads clustering
Group together actions that are functionally similar (and make them even more similar using contrastive alignment).
The HiERO architecture
Architecture of HiERO
The Temporal Encoder performs temporal reasoning on graph representations of the input video at different scales, while the Function-Aware Decoder recombines nodes in the video graph by matching segments that represent functional dependencies between the actions. HiERO is trained to align video segments with their corresponding textual narrations at the shallower layer, and to strengthen thread-aware clustering in deeper layers.

🔥 HiERO is good at video-language alignment...

We evaluate HiERO on EgoMCQ and EgoNLQ to validate its video-text alignment capabilities and to show that reasoning on functional threads at different scales can support various video understanding tasks. Together, video-narrations alignment and functional clustering are effective to discriminate between similar short-term actions, which is critical for EgoMCQ, as well as to capture long-range causal and temporal dependencies in the video, which is essential for EgoNLQ.

HiERO results on EgoMCQ and EgoNLQ.

🚀 ...and excels at (zero-shot) procedure learning!

We evaluate HiERO on EgoProceL in zero-shot, using visual features extracted from the Omnivore and EgoVLP backbones. we evaluate on this benchmark the ability of HiERO to group together parts of the video that correspond to the same high-level activity by leveraging their functional similarity, even though it was not trained explicitly to identify procedure steps inside a video.

HiERO results on Procedure Learning on EgoProceL.

We also evaluate HiERO on the Step-Grounding and Step Localization tasks from Ego4D Goal-Step, both in zero-shot and supervised settings. Compared to supervised approaches that learn a direct mapping between the video and the procedure steps, we argue that the steps detected by HiERO emerge as composition of low-level patterns that are clustered together.

Cite us

@article{peirone2025hiero,
  title={HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos},
  author={Peirone, Simone Alberto and Pistilli, Francesca and Averta, Giuseppe},
  journal={arXiv preprint arXiv:2505.12911},
  year={2025}
}