TY - JOUR
T1 - A Multi-Stream Sequence Learning Framework for Human Interaction Recognition
AU - Haroon, Umair
AU - Ullah, Amin
AU - Hussain, Tanveer
AU - Ullah, Waseem
AU - Sajjad, Muhammad
AU - Muhammad, Khan
AU - Lee, Mi Young
AU - Baik, Sung Wook
PY - 2022/1/27
Y1 - 2022/1/27
N2 - Human interaction recognition (HIR) is challenging due to multiple humans’ involvement and their mutual interaction in a single frame, generated from their movements. Mainstream literature is based on three-dimensional (3-D) convolutional neural networks (CNNs), processing only visual frames, where human joints data play a vital role in accurate interaction recognition. Therefore, this article proposes a multistream network for HIR that intelligently learns from skeletons’ key points and spatiotemporal visual representations. The first stream localises the joints of the human body using a pose estimation model and transmits them to a 1-D CNN and bidirectional long short-term memory to efficiently extract the features of the dynamic movements of each human skeleton. The second stream feeds the series of visual frames to a 3-D convolutional neural network to extract the discriminative spatiotemporal features. Finally, the outputs of both streams are integrated via fully connected layers that precisely classify the ongoing interactions between humans. To validate the performance of the proposed network, we conducted a comprehensive set of experiments on two benchmark datasets, UT-interaction and TV human interaction, and found 1.15% and 10.0% improvement in the accuracy.
AB - Human interaction recognition (HIR) is challenging due to multiple humans’ involvement and their mutual interaction in a single frame, generated from their movements. Mainstream literature is based on three-dimensional (3-D) convolutional neural networks (CNNs), processing only visual frames, where human joints data play a vital role in accurate interaction recognition. Therefore, this article proposes a multistream network for HIR that intelligently learns from skeletons’ key points and spatiotemporal visual representations. The first stream localises the joints of the human body using a pose estimation model and transmits them to a 1-D CNN and bidirectional long short-term memory to efficiently extract the features of the dynamic movements of each human skeleton. The second stream feeds the series of visual frames to a 3-D convolutional neural network to extract the discriminative spatiotemporal features. Finally, the outputs of both streams are integrated via fully connected layers that precisely classify the ongoing interactions between humans. To validate the performance of the proposed network, we conducted a comprehensive set of experiments on two benchmark datasets, UT-interaction and TV human interaction, and found 1.15% and 10.0% improvement in the accuracy.
UR - https://doi.org/10.1109/THMS.2021.3138708
U2 - 10.1109/THMS.2021.3138708
DO - 10.1109/THMS.2021.3138708
M3 - Article (journal)
SN - 2168-2291
VL - 52
SP - 435
EP - 444
JO - IEEE Transactions on Human-Machine Systems
JF - IEEE Transactions on Human-Machine Systems
IS - 3
ER -