TY - GEN
T1 - Improving Goal-Oriented Visual Dialogue by Asking Fewer Questions
AU - Kanazawa, Soma
AU - Matsumori, Shoya
AU - Imai, Michita
N1 - Funding Information:
Acknowledgments. This work was supported by JSPS KAKENHI Grant JP21J13789 and JST CREST Grant Number JPMJCR19A1, Japan.
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - An agent who adaptively asks the user questions to seek information is a crucial element in designing a real-world artificial intelligence agent. In particular, goal-oriented visual dialogue, which locates an object of interest from a group of visually presented objects by asking verbal questions, must be able to efficiently narrow down and identify objects through question generation. Several models based on GuessWhat?! and CLEVR Ask have been published, most of which leverage reinforcement learning to maximize the success rate of the task. However, existing models take a policy of asking questions up to a predefined limit, resulting in the generation of redundant questions. Moreover, the generated questions often refer only to a limited number of objects, which prevents efficient narrowing down and the identification of a wide range of attributes. This paper proposes Two-Stream Splitter (TSS) for redundant question reduction and efficient question generation. TSS utilizes a self-attention structure in the processing of image features and location features of objects to enable efficient narrowing down of candidate objects by combining the information content of both. Experimental results on the CLEVR Ask dataset show that the proposed method reduces redundant questions and enables efficient interaction compared to previous models.
AB - An agent who adaptively asks the user questions to seek information is a crucial element in designing a real-world artificial intelligence agent. In particular, goal-oriented visual dialogue, which locates an object of interest from a group of visually presented objects by asking verbal questions, must be able to efficiently narrow down and identify objects through question generation. Several models based on GuessWhat?! and CLEVR Ask have been published, most of which leverage reinforcement learning to maximize the success rate of the task. However, existing models take a policy of asking questions up to a predefined limit, resulting in the generation of redundant questions. Moreover, the generated questions often refer only to a limited number of objects, which prevents efficient narrowing down and the identification of a wide range of attributes. This paper proposes Two-Stream Splitter (TSS) for redundant question reduction and efficient question generation. TSS utilizes a self-attention structure in the processing of image features and location features of objects to enable efficient narrowing down of candidate objects by combining the information content of both. Experimental results on the CLEVR Ask dataset show that the proposed method reduces redundant questions and enables efficient interaction compared to previous models.
KW - Attention mechanism
KW - CLEVR Ask
KW - Goal-oriented visual dialogue
KW - GuessWhat?!
KW - Visual state estimation
UR - http://www.scopus.com/inward/record.url?scp=85121919161&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85121919161&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-92270-2_14
DO - 10.1007/978-3-030-92270-2_14
M3 - Conference contribution
AN - SCOPUS:85121919161
SN - 9783030922696
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 158
EP - 169
BT - Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings
A2 - Mantoro, Teddy
A2 - Lee, Minho
A2 - Ayu, Media Anugerah
A2 - Wong, Kok Wai
A2 - Hidayanto, Achmad Nizar
PB - Springer Science and Business Media Deutschland GmbH
T2 - 28th International Conference on Neural Information Processing, ICONIP 2021
Y2 - 8 December 2021 through 12 December 2021
ER -