TY - GEN
T1 - Parallel Implementation of CNN on Multi-FPGA Cluster
AU - Fukushima, Yasuyu
AU - Iizuka, Kensuke
AU - Amano, Hideharu
N1 - Funding Information:
This work was supported by JST CREST, Grant Number JPMJCR19K1, Japan.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - We developed a PYNQ cluster called M-KUBOS that consists of economical Zynq boards that are interconnected through low-cost high-performance GTH serial links. For the software environment, we employed the PYNQ open-source software platform. The PYNQ cluster is anticipated to be a multi-access edge computing (MEC) server for 5G mobile networks. We implemented the ResNet-50 inference accelerator on the PYNQ cluster for image recognition of MEC applications. By estimating the execution time of each ResNet-50 layer, layers of ResNet-50 were divided into four boards so that the execution time of each board would be as equal as possible for efficient pipeline processing. Owing to the PYNQ cluster in which FPGAs were directly connected by high-speed serial links, stream processing without network bottlenecks and pipeline processing between boards were readily realized. The implementation achieved 292 GOPS performance, 75.1 FPS throughput, and 5.15 GOPS/W power efficiency. It achieved 17 times faster speed and 86 times more power efficiency compared to the implementation on the CPU, and 3.8 times more power efficiency compared to the implementation on the GPU.
AB - We developed a PYNQ cluster called M-KUBOS that consists of economical Zynq boards that are interconnected through low-cost high-performance GTH serial links. For the software environment, we employed the PYNQ open-source software platform. The PYNQ cluster is anticipated to be a multi-access edge computing (MEC) server for 5G mobile networks. We implemented the ResNet-50 inference accelerator on the PYNQ cluster for image recognition of MEC applications. By estimating the execution time of each ResNet-50 layer, layers of ResNet-50 were divided into four boards so that the execution time of each board would be as equal as possible for efficient pipeline processing. Owing to the PYNQ cluster in which FPGAs were directly connected by high-speed serial links, stream processing without network bottlenecks and pipeline processing between boards were readily realized. The implementation achieved 292 GOPS performance, 75.1 FPS throughput, and 5.15 GOPS/W power efficiency. It achieved 17 times faster speed and 86 times more power efficiency compared to the implementation on the CPU, and 3.8 times more power efficiency compared to the implementation on the GPU.
KW - CNN
KW - MEC
KW - Multi FPGA
UR - http://www.scopus.com/inward/record.url?scp=85126660156&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126660156&partnerID=8YFLogxK
U2 - 10.1109/MCSoC51149.2021.00019
DO - 10.1109/MCSoC51149.2021.00019
M3 - Conference contribution
AN - SCOPUS:85126660156
T3 - Proceedings - 2021 IEEE 14th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2021
SP - 77
EP - 83
BT - Proceedings - 2021 IEEE 14th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th IEEE International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2021
Y2 - 20 December 2021 through 23 December 2021
ER -