Performance Evaluation of PEACH3: Field-Programmable Gate Array Switch for Tightly Coupled Accelerators

Takahiro Kaneda, Ryotaro Sakai, Naoki Nishikawa, Toshihiro Hanawa, Chiharu Tsuruta, Hideharu Amano

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An FPGA switching hub for tightly coupled accelerators (TCA) architecture called PEACH3 (PCI-Express Adaptive Communication Hub ver. 3) is evaluated and its communication speed is analyzed. PEACH3 connects a number of GPUs directly through PCI express Gen3x8 ports. The latency of inter-node GPU-GPU communication of PEACH3 was about 2.8 µ sec which is one third of that of CUDA API with MPI/Infiniband. The bandwidth was about 1.21 times of that of the previous version PEACH2, and 1.54 times of that with MPI/Infiniband for 512KB data transfer. Two application programs: BFS (breadth first search) and CG (conjugate gradient) were implemented with TCA IP and CUDA IP with MPI/Infiniband. The performance of BFS with PEACH3 was 1.16 times better than that with PEACH2, and 1.3 times better than that with MPI/Infiniband for a graph with scale = 15. In CG, for the small matrix (CLASS=S), the PEACH3 achieved 12% better performance than that with PEACH2 and 25% with MPI/Infiniband. However, since the bandwidth of PEACH3 with PCI gen3x8 is smaller than Infiniband with PCI gen3x16, the performance benefit was disappeared for CLASS=A matrix. Through the evaluation, it appears that if the data size is small, using TCA API with PEACH3 is advantageous even for intra-node communication.

Original languageEnglish
Title of host publicationProceedings of the 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450353168
DOIs
Publication statusPublished - 2017 Jun 7
Event8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017 - Bochum, Germany
Duration: 2017 Jun 72017 Jun 9

Other

Other8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017
CountryGermany
CityBochum
Period17/6/717/6/9

Fingerprint

Particle accelerators
Field programmable gate arrays (FPGA)
Switches
Communication
Application programming interfaces (API)
Bandwidth
Data transfer
Application programs
Graphics processing unit

Keywords

  • Cluster
  • GPU
  • PEACH3
  • TCA

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Kaneda, T., Sakai, R., Nishikawa, N., Hanawa, T., Tsuruta, C., & Amano, H. (2017). Performance Evaluation of PEACH3: Field-Programmable Gate Array Switch for Tightly Coupled Accelerators. In Proceedings of the 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017 [9] Association for Computing Machinery. https://doi.org/10.1145/3120895.3120911

Performance Evaluation of PEACH3 : Field-Programmable Gate Array Switch for Tightly Coupled Accelerators. / Kaneda, Takahiro; Sakai, Ryotaro; Nishikawa, Naoki; Hanawa, Toshihiro; Tsuruta, Chiharu; Amano, Hideharu.

Proceedings of the 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017. Association for Computing Machinery, 2017. 9.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kaneda, T, Sakai, R, Nishikawa, N, Hanawa, T, Tsuruta, C & Amano, H 2017, Performance Evaluation of PEACH3: Field-Programmable Gate Array Switch for Tightly Coupled Accelerators. in Proceedings of the 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017., 9, Association for Computing Machinery, 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017, Bochum, Germany, 17/6/7. https://doi.org/10.1145/3120895.3120911
Kaneda T, Sakai R, Nishikawa N, Hanawa T, Tsuruta C, Amano H. Performance Evaluation of PEACH3: Field-Programmable Gate Array Switch for Tightly Coupled Accelerators. In Proceedings of the 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017. Association for Computing Machinery. 2017. 9 https://doi.org/10.1145/3120895.3120911
Kaneda, Takahiro ; Sakai, Ryotaro ; Nishikawa, Naoki ; Hanawa, Toshihiro ; Tsuruta, Chiharu ; Amano, Hideharu. / Performance Evaluation of PEACH3 : Field-Programmable Gate Array Switch for Tightly Coupled Accelerators. Proceedings of the 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017. Association for Computing Machinery, 2017.
@inproceedings{383e092f0ec9475dbf420f28c07de82f,
title = "Performance Evaluation of PEACH3: Field-Programmable Gate Array Switch for Tightly Coupled Accelerators",
abstract = "An FPGA switching hub for tightly coupled accelerators (TCA) architecture called PEACH3 (PCI-Express Adaptive Communication Hub ver. 3) is evaluated and its communication speed is analyzed. PEACH3 connects a number of GPUs directly through PCI express Gen3x8 ports. The latency of inter-node GPU-GPU communication of PEACH3 was about 2.8 µ sec which is one third of that of CUDA API with MPI/Infiniband. The bandwidth was about 1.21 times of that of the previous version PEACH2, and 1.54 times of that with MPI/Infiniband for 512KB data transfer. Two application programs: BFS (breadth first search) and CG (conjugate gradient) were implemented with TCA IP and CUDA IP with MPI/Infiniband. The performance of BFS with PEACH3 was 1.16 times better than that with PEACH2, and 1.3 times better than that with MPI/Infiniband for a graph with scale = 15. In CG, for the small matrix (CLASS=S), the PEACH3 achieved 12{\%} better performance than that with PEACH2 and 25{\%} with MPI/Infiniband. However, since the bandwidth of PEACH3 with PCI gen3x8 is smaller than Infiniband with PCI gen3x16, the performance benefit was disappeared for CLASS=A matrix. Through the evaluation, it appears that if the data size is small, using TCA API with PEACH3 is advantageous even for intra-node communication.",
keywords = "Cluster, GPU, PEACH3, TCA",
author = "Takahiro Kaneda and Ryotaro Sakai and Naoki Nishikawa and Toshihiro Hanawa and Chiharu Tsuruta and Hideharu Amano",
year = "2017",
month = "6",
day = "7",
doi = "10.1145/3120895.3120911",
language = "English",
booktitle = "Proceedings of the 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Performance Evaluation of PEACH3

T2 - Field-Programmable Gate Array Switch for Tightly Coupled Accelerators

AU - Kaneda, Takahiro

AU - Sakai, Ryotaro

AU - Nishikawa, Naoki

AU - Hanawa, Toshihiro

AU - Tsuruta, Chiharu

AU - Amano, Hideharu

PY - 2017/6/7

Y1 - 2017/6/7

N2 - An FPGA switching hub for tightly coupled accelerators (TCA) architecture called PEACH3 (PCI-Express Adaptive Communication Hub ver. 3) is evaluated and its communication speed is analyzed. PEACH3 connects a number of GPUs directly through PCI express Gen3x8 ports. The latency of inter-node GPU-GPU communication of PEACH3 was about 2.8 µ sec which is one third of that of CUDA API with MPI/Infiniband. The bandwidth was about 1.21 times of that of the previous version PEACH2, and 1.54 times of that with MPI/Infiniband for 512KB data transfer. Two application programs: BFS (breadth first search) and CG (conjugate gradient) were implemented with TCA IP and CUDA IP with MPI/Infiniband. The performance of BFS with PEACH3 was 1.16 times better than that with PEACH2, and 1.3 times better than that with MPI/Infiniband for a graph with scale = 15. In CG, for the small matrix (CLASS=S), the PEACH3 achieved 12% better performance than that with PEACH2 and 25% with MPI/Infiniband. However, since the bandwidth of PEACH3 with PCI gen3x8 is smaller than Infiniband with PCI gen3x16, the performance benefit was disappeared for CLASS=A matrix. Through the evaluation, it appears that if the data size is small, using TCA API with PEACH3 is advantageous even for intra-node communication.

AB - An FPGA switching hub for tightly coupled accelerators (TCA) architecture called PEACH3 (PCI-Express Adaptive Communication Hub ver. 3) is evaluated and its communication speed is analyzed. PEACH3 connects a number of GPUs directly through PCI express Gen3x8 ports. The latency of inter-node GPU-GPU communication of PEACH3 was about 2.8 µ sec which is one third of that of CUDA API with MPI/Infiniband. The bandwidth was about 1.21 times of that of the previous version PEACH2, and 1.54 times of that with MPI/Infiniband for 512KB data transfer. Two application programs: BFS (breadth first search) and CG (conjugate gradient) were implemented with TCA IP and CUDA IP with MPI/Infiniband. The performance of BFS with PEACH3 was 1.16 times better than that with PEACH2, and 1.3 times better than that with MPI/Infiniband for a graph with scale = 15. In CG, for the small matrix (CLASS=S), the PEACH3 achieved 12% better performance than that with PEACH2 and 25% with MPI/Infiniband. However, since the bandwidth of PEACH3 with PCI gen3x8 is smaller than Infiniband with PCI gen3x16, the performance benefit was disappeared for CLASS=A matrix. Through the evaluation, it appears that if the data size is small, using TCA API with PEACH3 is advantageous even for intra-node communication.

KW - Cluster

KW - GPU

KW - PEACH3

KW - TCA

UR - http://www.scopus.com/inward/record.url?scp=85040669115&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040669115&partnerID=8YFLogxK

U2 - 10.1145/3120895.3120911

DO - 10.1145/3120895.3120911

M3 - Conference contribution

AN - SCOPUS:85040669115

BT - Proceedings of the 8th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017

PB - Association for Computing Machinery

ER -