Skewed checkpointing for tolerating multi-node failures

Hiroshi Nakamura, Takuro Hayashida, Masaaki Kondo, Yuya Tajima, Masashi Imai, Takashi Nanya

研究成果: Conference contribution

2 被引用数 (Scopus)

抄録

Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated checkpointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation. Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed checkpointing ensures [log 2N] degrees of redundancy when the number of nodes is N. In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems.

本文言語English
ホスト出版物のタイトルProceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, SRDS 2004
ページ116-125
ページ数10
DOI
出版ステータスPublished - 2004
外部発表はい
イベント23rd IEEE International Symposium on Reliable Distributed Systems, SRDS 2004 - Florianopolis, Brazil
継続期間: 2004 10月 182004 10月 20

出版物シリーズ

名前Proceedings of the IEEE Symposium on Reliable Distributed Systems
ISSN(印刷版)1060-9857

Conference

Conference23rd IEEE International Symposium on Reliable Distributed Systems, SRDS 2004
国/地域Brazil
CityFlorianopolis
Period04/10/1804/10/20

ASJC Scopus subject areas

  • ソフトウェア
  • 理論的コンピュータサイエンス
  • ハードウェアとアーキテクチャ
  • コンピュータ ネットワークおよび通信

フィンガープリント

「Skewed checkpointing for tolerating multi-node failures」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル