TY - JOUR
T1 - Affective Image Captioning for Visual Artworks Using Emotion-Based Cross-Attention Mechanisms
AU - Ishikawa, Shintaro
AU - Sugiura, Komei
N1 - Funding Information:
This work was supported in part by JSPS KAKENHI under Grant 20H04269, in part by JST CREST, and in part by NEDO.
Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - Within the museum community, the automatic generation of artwork description is expected to accelerate the improvement of accessibility for visually impaired visitors. Captions that describe artworks should be based on emotions because art is inseparable from viewers' emotional reactions. By contrast, artworks typically do not have unique interpretations; thus, it is difficult for systems to reflect the specified emotions in captions precisely. Most existing methods attempt to leverage predicted emotion labels from images to generate emotion-oriented captions; however, they do not allow users to specify arbitrary emotions. In this paper, we aim to build a model that generates emotion-conditioned captions that describe visual art. We propose an affective visual encoder, which integrates emotion attributes and cross-modal joint features of images into visual information over all encoder blocks. Moreover, we introduce affective tokens that fuse grid- and region-based image features to cover both contextual and object-level information. We validated our method on the ArtEmis dataset, and the results demonstrated that our method outperformed baseline methods on all metrics in the emotion-conditioned task.
AB - Within the museum community, the automatic generation of artwork description is expected to accelerate the improvement of accessibility for visually impaired visitors. Captions that describe artworks should be based on emotions because art is inseparable from viewers' emotional reactions. By contrast, artworks typically do not have unique interpretations; thus, it is difficult for systems to reflect the specified emotions in captions precisely. Most existing methods attempt to leverage predicted emotion labels from images to generate emotion-oriented captions; however, they do not allow users to specify arbitrary emotions. In this paper, we aim to build a model that generates emotion-conditioned captions that describe visual art. We propose an affective visual encoder, which integrates emotion attributes and cross-modal joint features of images into visual information over all encoder blocks. Moreover, we introduce affective tokens that fuse grid- and region-based image features to cover both contextual and object-level information. We validated our method on the ArtEmis dataset, and the results demonstrated that our method outperformed baseline methods on all metrics in the emotion-conditioned task.
KW - affective image captioning
KW - Emotion
KW - visual art
UR - http://www.scopus.com/inward/record.url?scp=85149882667&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85149882667&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3255887
DO - 10.1109/ACCESS.2023.3255887
M3 - Article
AN - SCOPUS:85149882667
SN - 2169-3536
VL - 11
SP - 24527
EP - 24534
JO - IEEE Access
JF - IEEE Access
ER -