This paper presents a method for learning novel objects from audio-visual input. Objects are learned using out-of-vocabulary word segmentation and object extraction. The latter half of this paper is devoted to evaluations. We propose the use of a task adopted from the RoboCup@Home league as a standard evaluation for real world applications. We have implemented proposed method on a real humanoid robot and evaluated it through a task called "Supermarket". The results reveal that our integrated system works well in the real application. In fact, our robot outperformed the maximum score obtained in RoboCup@Home 2009 competitions.