TY - JOUR
T1 - Target-Dependent UNITER
T2 - A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots
AU - Ishikawa, Shintaro
AU - Sugiura, Komei
N1 - Funding Information:
Manuscript received February 24, 2021; accepted July 20, 2021. Date of publication August 30, 2021; date of current version September 9, 2021. This work was supported in part by JSPS KAKENHI under Grant 20H04269, in part by JST Moonshot R&D under Grant JPMJMS2011, and in part by NEDO. (Corresponding author: Shintaro Ishikawa) The authors are with Keio University, Kanagawa 223-8522, Japan (e-mail: shin.0116@keio.jp; komei.sugiura@keio.jp). Digital Object Identifier 10.1109/LRA.2021.3108500 Fig. 1. Target-dependent UNITER overview: Transformer is used to model the relationship between objects and instruction.
Publisher Copyright:
© 2016 IEEE.
PY - 2021/10
Y1 - 2021/10
N2 - Currently, domestic service robots have an insufficient ability to interact naturally through language. This is because understanding human instructions is complicated by various ambiguities. In existing methods, the referring expressions that specify the relationships between objects were insufficiently modeled. In this letter, we propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image, rather than the whole image. Our method is an extension of the UNITER [1]-based Transformer that can be pretrained on general-purpose datasets. We extend the UNITER approach by introducing a new architecture for handling candidate objects. Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.
AB - Currently, domestic service robots have an insufficient ability to interact naturally through language. This is because understanding human instructions is complicated by various ambiguities. In existing methods, the referring expressions that specify the relationships between objects were insufficiently modeled. In this letter, we propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image, rather than the whole image. Our method is an extension of the UNITER [1]-based Transformer that can be pretrained on general-purpose datasets. We extend the UNITER approach by introducing a new architecture for handling candidate objects. Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.
KW - Deep learning methods
KW - deep learning for visual perception
UR - http://www.scopus.com/inward/record.url?scp=85114709194&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85114709194&partnerID=8YFLogxK
U2 - 10.1109/LRA.2021.3108500
DO - 10.1109/LRA.2021.3108500
M3 - Article
AN - SCOPUS:85114709194
SN - 2377-3766
VL - 6
SP - 8401
EP - 8408
JO - IEEE Robotics and Automation Letters
JF - IEEE Robotics and Automation Letters
IS - 4
M1 - 9525205
ER -