Abstract
Video summarization (VS) suppresses high-dimensional (HD) video data by only extracting the important information. However, prior research has not focused on the need for surveillance VS, that is used for many applications to assist video surveillance experts, including video retrieval and data storage. In addition, mainstream techniques commonly use two-dimensional (2-D) deep models for VS, ignoring event occurrences. Accordingly, we present a two-fold 3-D deep learning-assisted VS framework. First, we employ an inflated 3-D ConvNet model to extract temporal features; these features are optimized using a proposed encoder mechanism. The input video is temporally segmented using a feature comparison technique for selecting a single frame from each video segment. The segmented shots are evaluated using our novel shot segmentation evaluation scheme and are input into a saliency computation mechanism for keyframe selection in a second fold. Qualitative and quantitative analyses over VS benchmarks and surveillance videos demonstrate the superior performance of our framework, with 0.3- and 4.2-unit increases in the F1 scores for YouTube and title-based video summarization datasets, respectively. Along with accurate VS, a key contribution of our study is the novel shot segmentation criterion prior to VS, which can be used as a benchmark in future research to effectively prioritize HD visual data.
Original language | English |
---|---|
Pages (from-to) | 7946-7956 |
Journal | IEEE Transactions on Industrial Informatics |
Volume | 19 |
Issue number | 7 |
Early online date | 18 Oct 2022 |
DOIs | |
Publication status | Published - 31 Jul 2023 |
Keywords
- Data prioritization
- I3D features
- saliency extraction
- video analytics
- video sensing
- video summarization