Deep multi-scale pyramidal features network for supervised video summarization

Habib Khan, T. Hussain, Samee Ullah Khan, Zulfiqar Ahmad Khan, Sung Wook Baik*

*Corresponding author for this work

Research output: Contribution to journalArticle (journal)peer-review

41 Citations (Scopus)
9 Downloads (Pure)

Abstract

Video data are witnessing exponential growth, and extracting summarized information is challenging. It is always necessary to reduce the load of video traffic for the efficient video storage, transmission, and retrieval requirements. The aim of video summarization (VS) is to extract the most important contents from video repositories effectively. Recent attempts have used fewer representative features, which have been fed to recurrent networks to achieve VS. However, generating the desired summaries can become challenging due to the limited representativeness of extracted features and a lack of consideration for feature refinement. In this article, we introduce a vision transformer (ViT)-assisted deep pyramidal refinement network that can extract and refine multi-scale features and can predict an importance score for each frame. The proposed network comprises four main modules; initially, a dense prediction transformer with a ViT backbone is applied for the first time in this domain to acquire the optimal representations from the input frames. Then, feature maps are obtained from various layers separately and processed individually to support multi-scale progressive feature fusion and refinement before the data are passed to the ultimate prediction module. Next, a customized pyramidal refinement block is employed to refine the multi-level feature set before predicting the importance scores. Finally, video summaries are produced by selecting keyframes based on the predictions. To explore the performance of the proposed network, extensive experiments are conducted on the TVSum and SumMe datasets, and our network is found to achieve F1-scores of 62.4% and 51.9%, respectively, outperforming state-of-the-art alternatives by 0.9% and 0.5%.
Original languageEnglish
Article number 121288
Pages (from-to)1-14
JournalExpert Systems with Applications
Volume237
Early online date29 Aug 2023
DOIs
Publication statusPublished - 1 Mar 2024

Keywords

  • Video summarization
  • Supervised Learning
  • Keyframes
  • Feature fusion
  • Supervised learning
  • Keyshots
  • Feature refinement

Fingerprint

Dive into the research topics of 'Deep multi-scale pyramidal features network for supervised video summarization'. Together they form a unique fingerprint.

Cite this