Moment-based Adversarial Training for Embodied Language Comprehension

Shintaro Ishikawa, Komei Sugiura

Research output: Chapter in Book/Report/Conference proceedingConference contribution


In this paper, we focus on a vision-and-language task in which a robot is instructed to execute household tasks. Given an instruction such as "Rinse off a mug and place it in the coffee maker,"the robot is required to locate the mug, wash it, and put it in the coffee maker. This is challenging because the robot needs to break down the instruction sentences into subgoals and execute them in the correct order. On the ALFRED benchmark, the performance of state-of-the-art methods is still far lower than that of humans. This is partially because existing methods sometimes fail to infer subgoals that are not explicitly specified in the instruction sentences. We propose Moment-based Adversarial Training (MAT), which uses two types of moments for perturbation updates in adversarial training. We introduce MAT to the embedding spaces of the instruction, subgoals, and state representations to handle their varieties. We validated our method on the ALFRED benchmark, and the results demonstrated that our method outperformed the baseline method for all the metrics on the benchmark.

Original languageEnglish
Title of host publication2022 26th International Conference on Pattern Recognition, ICPR 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages7
ISBN (Electronic)9781665490627
Publication statusPublished - 2022
Event26th International Conference on Pattern Recognition, ICPR 2022 - Montreal, Canada
Duration: 2022 Aug 212022 Aug 25

Publication series

NameProceedings - International Conference on Pattern Recognition
ISSN (Print)1051-4651


Conference26th International Conference on Pattern Recognition, ICPR 2022

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition


Dive into the research topics of 'Moment-based Adversarial Training for Embodied Language Comprehension'. Together they form a unique fingerprint.

Cite this