TY - GEN
T1 - Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition
AU - Wang, Yanan
AU - Wu, Jianming
AU - Heracleous, Panikos
AU - Wada, Shinya
AU - Kimura, Rui
AU - Kurihara, Satoshi
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/10/21
Y1 - 2020/10/21
N2 - Audio-video group emotion recognition is a challenging task since it is difficult to gather a broad range of potential information to obtain meaningful emotional representations. Humans can easily understand emotions because they can associate implicit contextual knowledge (contained in our memory) when processing explicit information they can see and hear directly. This paper proposes an end-to-end architecture called implicit knowledge injectable cross attention audiovisual deep neural network (K-injection audiovisual network) that imitates this intuition. The K-injection audiovisual network is used to train an audiovisual model that can not only obtain audiovisual representations of group emotions through an explicit feature-based cross attention audiovisual subnetwork (audiovisual subnetwork), but is also able to absorb implicit knowledge of emotions through two implicit knowledge-based injection subnetworks (K-injection subnetwork). In addition, it is trained with explicit features and implicit knowledge but can easily make inferences using only explicit features. We define the region of interest (ROI) visual features and Melspectrogram audio features as explicit features, which obviously are present in the raw audio-video data. On the other hand, we define the linguistic and acoustic emotional representations that do not exist in the audio-video data as implicit knowledge. The implicit knowledge distilled by adapting video situation descriptions and basic acoustic features (MFCCs, pitch and energy) to linguistic and acoustic K-injection subnetworks is defined as linguistic and acoustic knowledge, respectively. When compared to the baseline accuracy for the testing set of 47.88%, the average of the audiovisual models trained with the (linguistic, acoustic and linguistic-acoustic) K-injection subnetworks achieved an overall accuracy of 66.40%.
AB - Audio-video group emotion recognition is a challenging task since it is difficult to gather a broad range of potential information to obtain meaningful emotional representations. Humans can easily understand emotions because they can associate implicit contextual knowledge (contained in our memory) when processing explicit information they can see and hear directly. This paper proposes an end-to-end architecture called implicit knowledge injectable cross attention audiovisual deep neural network (K-injection audiovisual network) that imitates this intuition. The K-injection audiovisual network is used to train an audiovisual model that can not only obtain audiovisual representations of group emotions through an explicit feature-based cross attention audiovisual subnetwork (audiovisual subnetwork), but is also able to absorb implicit knowledge of emotions through two implicit knowledge-based injection subnetworks (K-injection subnetwork). In addition, it is trained with explicit features and implicit knowledge but can easily make inferences using only explicit features. We define the region of interest (ROI) visual features and Melspectrogram audio features as explicit features, which obviously are present in the raw audio-video data. On the other hand, we define the linguistic and acoustic emotional representations that do not exist in the audio-video data as implicit knowledge. The implicit knowledge distilled by adapting video situation descriptions and basic acoustic features (MFCCs, pitch and energy) to linguistic and acoustic K-injection subnetworks is defined as linguistic and acoustic knowledge, respectively. When compared to the baseline accuracy for the testing set of 47.88%, the average of the audiovisual models trained with the (linguistic, acoustic and linguistic-acoustic) K-injection subnetworks achieved an overall accuracy of 66.40%.
KW - affective computing
KW - machine learning for multimodal interaction
KW - multimodal fusion and representation
UR - http://www.scopus.com/inward/record.url?scp=85096691083&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096691083&partnerID=8YFLogxK
U2 - 10.1145/3382507.3417960
DO - 10.1145/3382507.3417960
M3 - Conference contribution
AN - SCOPUS:85096691083
T3 - ICMI 2020 - Proceedings of the 2020 International Conference on Multimodal Interaction
SP - 827
EP - 834
BT - ICMI 2020 - Proceedings of the 2020 International Conference on Multimodal Interaction
PB - Association for Computing Machinery, Inc
T2 - 22nd ACM International Conference on Multimodal Interaction, ICMI 2020
Y2 - 25 October 2020 through 29 October 2020
ER -