Temporal Cross-Modal Attention for Audio-Visual Event Localization

Yoshiki Nagasaki, Masaki Hayashi, Naoshi Kaneko, Yoshimitsu Aoki

研究成果: Article査読

抄録

In this paper, we propose a new method for audio-visual event localization 1) to find the corresponding segment between audio and visual event. While previous methods use Long Short-Term Memory (LSTM) networks to extract temporal features, recurrent neural networks like LSTM are not able to precisely learn long-term features. Thus, we propose a Temporal Cross-Modal Attention (TCMA) module, which extract temporal features more precisely from the two modalities. Inspired by the success of attention modules in capturing long-term features, we introduce TCMA, which incorporates self-attention. Finally, we were able to localize audio-visual event precisely and achieved a higher accuracy than the previous works.

本文言語English
ページ(範囲)263-268
ページ数6
ジャーナルSeimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering
88
3
DOI
出版ステータスPublished - 2022

ASJC Scopus subject areas

  • 機械工学

フィンガープリント

「Temporal Cross-Modal Attention for Audio-Visual Event Localization」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル