Recently, the demand for agricultural autonomous robots has been increasing. Using the technology of vision-based robotic environmental recognition, they can generally follow farmers to support their work activities, such as conveyance of the harvest. However, a major issue arises in that dynamic objects (including humans) often enter images that the robots rely on for environmental recognition tasks. These dynamic objects degrade the performance of image recognition considerably, resulting in collisions with crops or ridges when the robots are following the worker. To address the occlusion issue, generative adversarial network (GAN) solutions can be adopted as they feature a generative capability to reconstruct the area behind dynamic objects. However, precedented GAN methods basically presuppose paired image datasets to train their networks, which are difficult to prepare. Therefore, a method based on unpaired image datasets is desirable in real-world environments, such as a farm. For this purpose, we propose a new approach by integrating the state-of-the-art neural network architecture, CycleGAN, and Mask R CNN. Our system is trained with a human-tracking dataset collected by an agricultural autonomous robot in a farm. We evaluate the performance of our system both qualitatively and quantitatively for the task of human removal in images.