TY - GEN
T1 - Skewed checkpointing for tolerating multi-node failures
AU - Nakamura, Hiroshi
AU - Hayashida, Takuro
AU - Kondo, Masaaki
AU - Tajima, Yuya
AU - Imai, Masashi
AU - Nanya, Takashi
PY - 2004
Y1 - 2004
N2 - Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated checkpointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation. Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed checkpointing ensures [log 2N] degrees of redundancy when the number of nodes is N. In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems.
AB - Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated checkpointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation. Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed checkpointing ensures [log 2N] degrees of redundancy when the number of nodes is N. In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems.
UR - http://www.scopus.com/inward/record.url?scp=16244407130&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=16244407130&partnerID=8YFLogxK
U2 - 10.1109/RELDIS.2004.1353012
DO - 10.1109/RELDIS.2004.1353012
M3 - Conference contribution
AN - SCOPUS:16244407130
SN - 0769522394
T3 - Proceedings of the IEEE Symposium on Reliable Distributed Systems
SP - 116
EP - 125
BT - Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, SRDS 2004
T2 - 23rd IEEE International Symposium on Reliable Distributed Systems, SRDS 2004
Y2 - 18 October 2004 through 20 October 2004
ER -