CNNs have traditionally been applied in computer vision. Recently, applying Transformer networks, originally a technique in natural language processing, to computer vision has received much attention and produced superior results. However, Transformers and their derivation have drawbacks that the computational cost and memory usage increase rapidly with the image resolution. In this paper, we propose the Laplacian Pyramid Translation Transformer (LPTT) for image to image translation. The Laplacian Pyramid Translation Network, a previous study of this work, creates Laplacian pyramid of the input images and processes each component with CNNs. However, LPTT transforms the high-frequency components with CNNs and the low-frequency components with Axial Transformer blocks. LPTT can have Transformer’s expressive power while reducing the computational cost and memory usage. LPTT significantly improves the quality of generated images and inference speed for high-resolution images over conventional methods. LPTT is the first network with a Transformer that can perform practical inference in real time on 4K resolution images. LPTT can also process 8K images in real time depending on the model conditions and the performance of the GPU. The ablation study in this paper suggests that even when processing high-resolution images, the performance is improved while maintaining the inference speed by computing the low-resolution component with a Transformer. LPTT improves PSNR value by 0.41 dB in MIT-Adobe FiveK dataset. The greater the number of layers in the Laplacian pyramid, the greater the improvement of LPTT over the Laplacian Pyramid Translation Network.
ASJC Scopus subject areas
- コンピュータ サイエンス（全般）