TY - JOUR
T1 - Deep multi-scale pyramidal features network for supervised video summarization
AU - Khan, Habib
AU - Hussain, T.
AU - Ullah Khan, Samee
AU - Ahmad Khan, Zulfiqar
AU - Baik, Sung Wook
N1 - Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2024/3/1
Y1 - 2024/3/1
N2 - Video data are witnessing exponential growth, and extracting summarized information is challenging. It is always necessary to reduce the load of video traffic for the efficient video storage, transmission, and retrieval requirements. The aim of video summarization (VS) is to extract the most important contents from video repositories effectively. Recent attempts have used fewer representative features, which have been fed to recurrent networks to achieve VS. However, generating the desired summaries can become challenging due to the limited representativeness of extracted features and a lack of consideration for feature refinement. In this article, we introduce a vision transformer (ViT)-assisted deep pyramidal refinement network that can extract and refine multi-scale features and can predict an importance score for each frame. The proposed network comprises four main modules; initially, a dense prediction transformer with a ViT backbone is applied for the first time in this domain to acquire the optimal representations from the input frames. Then, feature maps are obtained from various layers separately and processed individually to support multi-scale progressive feature fusion and refinement before the data are passed to the ultimate prediction module. Next, a customized pyramidal refinement block is employed to refine the multi-level feature set before predicting the importance scores. Finally, video summaries are produced by selecting keyframes based on the predictions. To explore the performance of the proposed network, extensive experiments are conducted on the TVSum and SumMe datasets, and our network is found to achieve F1-scores of 62.4% and 51.9%, respectively, outperforming state-of-the-art alternatives by 0.9% and 0.5%.
AB - Video data are witnessing exponential growth, and extracting summarized information is challenging. It is always necessary to reduce the load of video traffic for the efficient video storage, transmission, and retrieval requirements. The aim of video summarization (VS) is to extract the most important contents from video repositories effectively. Recent attempts have used fewer representative features, which have been fed to recurrent networks to achieve VS. However, generating the desired summaries can become challenging due to the limited representativeness of extracted features and a lack of consideration for feature refinement. In this article, we introduce a vision transformer (ViT)-assisted deep pyramidal refinement network that can extract and refine multi-scale features and can predict an importance score for each frame. The proposed network comprises four main modules; initially, a dense prediction transformer with a ViT backbone is applied for the first time in this domain to acquire the optimal representations from the input frames. Then, feature maps are obtained from various layers separately and processed individually to support multi-scale progressive feature fusion and refinement before the data are passed to the ultimate prediction module. Next, a customized pyramidal refinement block is employed to refine the multi-level feature set before predicting the importance scores. Finally, video summaries are produced by selecting keyframes based on the predictions. To explore the performance of the proposed network, extensive experiments are conducted on the TVSum and SumMe datasets, and our network is found to achieve F1-scores of 62.4% and 51.9%, respectively, outperforming state-of-the-art alternatives by 0.9% and 0.5%.
KW - Video summarization
KW - Supervised Learning
KW - Keyframes
KW - Feature fusion
KW - Supervised learning
KW - Keyshots
KW - Feature refinement
UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-85171612054&partnerID=MN8TOARS
UR - http://www.scopus.com/inward/record.url?scp=85171612054&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85171612054&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/3b1f82db-e166-3888-8bfb-47268e584260/
U2 - 10.1016/j.eswa.2023.121288
DO - 10.1016/j.eswa.2023.121288
M3 - Article (journal)
SN - 0957-4174
VL - 237
SP - 1
EP - 14
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 121288
ER -