Abstract
In this paper, we propose a new method for audio-visual event localization 1) to find the corresponding segment between audio and visual event. While previous methods use Long Short-Term Memory (LSTM) networks to extract temporal features, recurrent neural networks like LSTM are not able to precisely learn long-term features. Thus, we propose a Temporal Cross-Modal Attention (TCMA) module, which extract temporal features more precisely from the two modalities. Inspired by the success of attention modules in capturing long-term features, we introduce TCMA, which incorporates self-attention. Finally, we were able to localize audio-visual event precisely and achieved a higher accuracy than the previous works.
Original language | English |
---|---|
Pages (from-to) | 263-268 |
Number of pages | 6 |
Journal | Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering |
Volume | 88 |
Issue number | 3 |
DOIs | |
Publication status | Published - 2022 |
Keywords
- audio-visual
- event localization
- multi-modal
- self-attention
ASJC Scopus subject areas
- Mechanical Engineering