Skewed checkpointing for tolerating multi-node failures

Hiroshi Nakamura, Takuro Hayashida, Masaaki Kondo, Yuya Tajima, Masashi Imai, Takashi Nanya

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated checkpointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation. Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed checkpointing ensures [log 2N] degrees of redundancy when the number of nodes is N. In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems.

Original languageEnglish
Title of host publicationProceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, SRDS 2004
Pages116-125
Number of pages10
DOIs
Publication statusPublished - 2004
Externally publishedYes
Event23rd IEEE International Symposium on Reliable Distributed Systems, SRDS 2004 - Florianopolis, Brazil
Duration: 2004 Oct 182004 Oct 20

Publication series

NameProceedings of the IEEE Symposium on Reliable Distributed Systems
ISSN (Print)1060-9857

Conference

Conference23rd IEEE International Symposium on Reliable Distributed Systems, SRDS 2004
Country/TerritoryBrazil
CityFlorianopolis
Period04/10/1804/10/20

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Skewed checkpointing for tolerating multi-node failures'. Together they form a unique fingerprint.

Cite this