This paper presents an approach for recognising activities using video from an egocentric (first-person view) setup. Our approach infers activity from the interactions of objects and hands. In contrast to previous approaches to activity recognition, we do not require to use an intermediate such as object detection, pose estimation, etc. Recently, it has been shown that modelling the spatial distribution of visual words corresponding to local features further improves the performance of activity recognition using the bag-of-visual words representation. Influenced and inspired by this philosophy, our method is based on global spatio-temporal relationships between visual words. We consider the interaction between visual words by encoding their spatial distances, orientations and alignments. These interactions are encoded using a histogram that we name the Histogram of Oriented Pairwise Relations (HOPR). The proposed approach is robust to occlusion and background variation and is evaluated on two challenging egocentric activity datasets consisting of manipulative task. We introduce a novel representation of activities based on interactions of local features and experimentally demonstrate its superior performance in comparison to standard activity representations such as bag-of-visual words.
|Published - 5 Jan 2014
|9th International Conference on Computer Vision Theory and Applications (VISAPP) - Lisbon, Portugal
Duration: 5 Jan 2014 → 8 Jan 2014
|9th International Conference on Computer Vision Theory and Applications (VISAPP)
|5/01/14 → 8/01/14