An In-Network Parameter Aggregation using DPDK for Multi-GPU Deep Learning

Masaki Furukawa, Tomoya Itsubo, Hiroki Matsutani

研究成果: Conference contribution

2 被引用数 (Scopus)

抄録

In distributed deep neural network using remote GPU nodes, communication occurs iteratively between remote nodes for gradient aggregation. This communication latency limits the benefit of distributed training with faster GPUs. In distributed deep learning using the remote GPUs, workload of gradient aggregation is imposed on a host machine. In this paper, we therefore propose to offload the gradient aggregation to a DPDK (Data Plane Development Kit) based network switch between the host machine and remote GPUs. In this approach, the aggregation process is completed in the network using extra computation resources in the network switch. We evaluate the proposed switch when GPUs and the host communicate with a standard IP communication and a PCI Express (PCIe) over 40Gbit Ethernet (40GbE) product, respectively. The evaluation results using a standard IP communication show that the aggregation is accelerated by 2.2-2.5x compared to the aggregation executed by a host machine. The results using the PCIe over 40GbE product show that the proposed switch outperforms the aggregation done by the host machine by 1.16x. This approach is thus useful for distributed training with multiple GPUs.

本文言語English
ホスト出版物のタイトルProceedings - 2020 8th International Symposium on Computing and Networking, CANDAR 2020
出版社Institute of Electrical and Electronics Engineers Inc.
ページ108-114
ページ数7
ISBN(電子版)9781728182216
DOI
出版ステータスPublished - 2020 11月
イベント8th International Symposium on Computing and Networking, CANDAR 2020 - Virtual, Naha, Japan
継続期間: 2020 11月 242020 11月 27

出版物シリーズ

名前Proceedings - 2020 8th International Symposium on Computing and Networking, CANDAR 2020

Conference

Conference8th International Symposium on Computing and Networking, CANDAR 2020
国/地域Japan
CityVirtual, Naha
Period20/11/2420/11/27

ASJC Scopus subject areas

  • 人工知能
  • 計算理論と計算数学
  • コンピュータ ネットワークおよび通信
  • コンピュータ サイエンスの応用
  • ソフトウェア

フィンガープリント

「An In-Network Parameter Aggregation using DPDK for Multi-GPU Deep Learning」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル