Lightweight recovery from kernel failures using phase-based reboot

Kazuya Yamakita, Hiroshi Yamada, Kenji Kono

Research output: Contribution to journalArticle

Abstract

Although operating systems (OSes) are crucial to achieving high availability of computer systems, modern OSes are far from bug-free. Rebooting the OS is simple, powerful, and sometimes the only remedy for kernel failures. Once we accept reboot-based recovery as a fact of life, we should try to ensure that the downtime caused by reboots is as short as possible. This paper presents "phase-based" reboots that shorten the downtime caused by reboot-based recovery. The key idea is to divide a boot sequence into phases. The phase-based reboot reuses a system state in the previous boot if the next boot reproduces the same state. A prototype of the phase-based reboot was implemented on Xen 3.4.1 running para-virtualized Linux 2.6.18. Experiments with the prototype show that it successfully recovered from kernel transient failures inserted by a fault injector, and its downtime was 34.3% to 93.6% shorter than that of the normal reboot-based recovery.

Original languageEnglish
Pages (from-to)59-70
Number of pages12
JournalIPSJ Online Transactions
Volume5
Issue number1
DOIs
Publication statusPublished - 2012

Fingerprint

Recovery
Computer systems
Availability
Experiments
Linux

Keywords

  • Operating system reliability
  • Reboot-based recovery
  • Virtualization

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Lightweight recovery from kernel failures using phase-based reboot. / Yamakita, Kazuya; Yamada, Hiroshi; Kono, Kenji.

In: IPSJ Online Transactions, Vol. 5, No. 1, 2012, p. 59-70.

Research output: Contribution to journalArticle

Yamakita, Kazuya ; Yamada, Hiroshi ; Kono, Kenji. / Lightweight recovery from kernel failures using phase-based reboot. In: IPSJ Online Transactions. 2012 ; Vol. 5, No. 1. pp. 59-70.
@article{3bec3834d79149479900406cc5de9b97,
title = "Lightweight recovery from kernel failures using phase-based reboot",
abstract = "Although operating systems (OSes) are crucial to achieving high availability of computer systems, modern OSes are far from bug-free. Rebooting the OS is simple, powerful, and sometimes the only remedy for kernel failures. Once we accept reboot-based recovery as a fact of life, we should try to ensure that the downtime caused by reboots is as short as possible. This paper presents {"}phase-based{"} reboots that shorten the downtime caused by reboot-based recovery. The key idea is to divide a boot sequence into phases. The phase-based reboot reuses a system state in the previous boot if the next boot reproduces the same state. A prototype of the phase-based reboot was implemented on Xen 3.4.1 running para-virtualized Linux 2.6.18. Experiments with the prototype show that it successfully recovered from kernel transient failures inserted by a fault injector, and its downtime was 34.3{\%} to 93.6{\%} shorter than that of the normal reboot-based recovery.",
keywords = "Operating system reliability, Reboot-based recovery, Virtualization",
author = "Kazuya Yamakita and Hiroshi Yamada and Kenji Kono",
year = "2012",
doi = "10.2197/ipsjtrans.5.59",
language = "English",
volume = "5",
pages = "59--70",
journal = "IPSJ Online Transactions",
issn = "1882-6660",
publisher = "Information Processing Society of Japan",
number = "1",

}

TY - JOUR

T1 - Lightweight recovery from kernel failures using phase-based reboot

AU - Yamakita, Kazuya

AU - Yamada, Hiroshi

AU - Kono, Kenji

PY - 2012

Y1 - 2012

N2 - Although operating systems (OSes) are crucial to achieving high availability of computer systems, modern OSes are far from bug-free. Rebooting the OS is simple, powerful, and sometimes the only remedy for kernel failures. Once we accept reboot-based recovery as a fact of life, we should try to ensure that the downtime caused by reboots is as short as possible. This paper presents "phase-based" reboots that shorten the downtime caused by reboot-based recovery. The key idea is to divide a boot sequence into phases. The phase-based reboot reuses a system state in the previous boot if the next boot reproduces the same state. A prototype of the phase-based reboot was implemented on Xen 3.4.1 running para-virtualized Linux 2.6.18. Experiments with the prototype show that it successfully recovered from kernel transient failures inserted by a fault injector, and its downtime was 34.3% to 93.6% shorter than that of the normal reboot-based recovery.

AB - Although operating systems (OSes) are crucial to achieving high availability of computer systems, modern OSes are far from bug-free. Rebooting the OS is simple, powerful, and sometimes the only remedy for kernel failures. Once we accept reboot-based recovery as a fact of life, we should try to ensure that the downtime caused by reboots is as short as possible. This paper presents "phase-based" reboots that shorten the downtime caused by reboot-based recovery. The key idea is to divide a boot sequence into phases. The phase-based reboot reuses a system state in the previous boot if the next boot reproduces the same state. A prototype of the phase-based reboot was implemented on Xen 3.4.1 running para-virtualized Linux 2.6.18. Experiments with the prototype show that it successfully recovered from kernel transient failures inserted by a fault injector, and its downtime was 34.3% to 93.6% shorter than that of the normal reboot-based recovery.

KW - Operating system reliability

KW - Reboot-based recovery

KW - Virtualization

UR - http://www.scopus.com/inward/record.url?scp=84862123809&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862123809&partnerID=8YFLogxK

U2 - 10.2197/ipsjtrans.5.59

DO - 10.2197/ipsjtrans.5.59

M3 - Article

AN - SCOPUS:84862123809

VL - 5

SP - 59

EP - 70

JO - IPSJ Online Transactions

JF - IPSJ Online Transactions

SN - 1882-6660

IS - 1

ER -