Low-reliable low-latency networks optimized for HPC parallel applications

Truong Thao Nguyen, Hiroki Matsutani, Michihiro Koibuchi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

High-end network standards, such as 400GbE, have been introduced Forwarding Error Correction (FEC) for maintaining the same bit error rate (BER) as that in traditional low-bandwidth interconnection networks. However, FEC operation latency overhead surprisingly becomes higher than the sum of all the other switch operation overheads, e.g., routing computation and switch allocation. FEC operation latency overhead significantly degrades the performance of parallel applications in HPC systems. Instead, in this study, we exploit the low-latency network design using a Hamming code that does not provide rigid error-free communication. Since it is consistent with existing frame format based on standard Reed-Solomon RS(544,514) with DC(64b/66b) direct linecode and TC(256b/257b) transcode, respectively, the influences upon the other network layer design are limited. Interestingly, a large number of parallel applications can accept the BER in such a Hamming code. Since lowering such a BER improves switch operation latency, the proposed network using the Hamming code improves the execution time of NAS Parallel Benchmarks by 56% on average when compared to the counterpart RS-FEC networks.

Original languageEnglish
Title of host publicationNCA 2018 - 2018 IEEE 17th International Symposium on Network Computing and Applications
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538676592
DOIs
Publication statusPublished - 2018 Nov 26
Event17th IEEE International Symposium on Network Computing and Applications, NCA 2018 - Cambridge, United States
Duration: 2018 Nov 12018 Nov 3

Publication series

NameNCA 2018 - 2018 IEEE 17th International Symposium on Network Computing and Applications

Other

Other17th IEEE International Symposium on Network Computing and Applications, NCA 2018
Country/TerritoryUnited States
CityCambridge
Period18/11/118/11/3

Keywords

  • Forwarding Error Correction (FEC)
  • High Performance Computing (HPC)
  • Interconnection networks

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Low-reliable low-latency networks optimized for HPC parallel applications'. Together they form a unique fingerprint.

Cite this