To create dialogue systems that provide information a user needs to know at an opportune moment, it is important to infer the user's mental states such as his/her beliefs and desires. There are two types of study on inferring beliefs and desires: one type infers them from actions and the other infers them from the content of utterances. However, a method to infer beliefs and desires from both kinds of inference in an integrated way has not yet been established. In this paper, we propose Multimodal Inference of Mind Simultaneous Contextualization and Interpreting (MIoM SCAIN), a system for sequentially inferring users' beliefs and desires on the basis of their walking behaviors and the content of their utterances. In our evaluation, we compared inferences of MIoM SCAIN with those of baselines that use either walking behaviors or the content of utterances. MIoM SCAIN's predictions showed more correlation with subjective judgements compared with the baselines, indicating that the inference of beliefs and desires from both walking behaviors and utterance content is possible.