Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel Ellsworth, Barry Rountree, Martin Schulz

研究成果: Conference contribution

21 被引用数 (Scopus)

抄録

Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.

本文言語English
ホスト出版物のタイトルProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017
出版社Institute of Electrical and Electronics Engineers Inc.
ページ957-966
ページ数10
ISBN(電子版)9781538639146
DOI
出版ステータスPublished - 2017 6月 30
外部発表はい
イベント31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017 - Orlando, United States
継続期間: 2017 5月 292017 6月 2

出版物シリーズ

名前Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017

Conference

Conference31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017
国/地域United States
CityOrlando
Period17/5/2917/6/2

ASJC Scopus subject areas

  • 情報システム
  • コンピュータ ネットワークおよび通信
  • ハードウェアとアーキテクチャ

フィンガープリント

「Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル