Using fault injection to analyze the scope of error propagation in linux

Takeshi Yoshimura, Hiroshi Yamada, Kenji Kono

Research output: Contribution to journalArticle

Abstract

Operating systems (OSes) are crucial for achieving high availability of computer systems. Even if applications running on an operating system are highly available, a bug inside the kernel may result in a failure of the entire software stack. The objective of this study is to gain some insight into the development of the Linux kernel that is more resilient against software faults. In particular, this paper investigates the scope of error propagation. The propagation scope is process-local if the erroneous value is not propagated outside the process context that activated it. The scope is kernel-global if the erroneous value is propagated outside the process context that activated it. The investigation of the scope of error propagation gives us some insight into 1) defensive coding style, 2) reboot-less rejuvenation, and 3) general recovery mechanisms of the Linux kernel. For example, if most errors are process-local, we can rejuvenate the kernel without reboots because the kernel can be recovered simply by killing faulty processes. To investigate the scope of error propagation, we conduct an experimental campaign of fault injection on Linux 2.6.18, using a kernel-level fault injector widely used in the OS community. Our findings are (1) our target kernel (Linux 2.6.18) is coded defensively. This defensive coding style contributes to lower rates of error manifestation and kernel-global errors, (2) the scope of error propagation is mostly process-local in Linux, and (3) global propagation occurs with low probability. Even if an error corrupts a global data structure, other processes merely access to them.

Original languageEnglish
Pages (from-to)55-64
Number of pages10
JournalIPSJ Online Transactions
Volume6
Issue number1
DOIs
Publication statusPublished - 2013

Fingerprint

Linux
Computer operating systems
Data structures
Computer systems
Availability
Recovery

Keywords

  • Error propagation
  • Fault injection
  • Rejuvenation
  • Software faults
  • System dependability

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Using fault injection to analyze the scope of error propagation in linux. / Yoshimura, Takeshi; Yamada, Hiroshi; Kono, Kenji.

In: IPSJ Online Transactions, Vol. 6, No. 1, 2013, p. 55-64.

Research output: Contribution to journalArticle

Yoshimura, Takeshi ; Yamada, Hiroshi ; Kono, Kenji. / Using fault injection to analyze the scope of error propagation in linux. In: IPSJ Online Transactions. 2013 ; Vol. 6, No. 1. pp. 55-64.
@article{64bd13c9167548cfb90786a9fb3bca28,
title = "Using fault injection to analyze the scope of error propagation in linux",
abstract = "Operating systems (OSes) are crucial for achieving high availability of computer systems. Even if applications running on an operating system are highly available, a bug inside the kernel may result in a failure of the entire software stack. The objective of this study is to gain some insight into the development of the Linux kernel that is more resilient against software faults. In particular, this paper investigates the scope of error propagation. The propagation scope is process-local if the erroneous value is not propagated outside the process context that activated it. The scope is kernel-global if the erroneous value is propagated outside the process context that activated it. The investigation of the scope of error propagation gives us some insight into 1) defensive coding style, 2) reboot-less rejuvenation, and 3) general recovery mechanisms of the Linux kernel. For example, if most errors are process-local, we can rejuvenate the kernel without reboots because the kernel can be recovered simply by killing faulty processes. To investigate the scope of error propagation, we conduct an experimental campaign of fault injection on Linux 2.6.18, using a kernel-level fault injector widely used in the OS community. Our findings are (1) our target kernel (Linux 2.6.18) is coded defensively. This defensive coding style contributes to lower rates of error manifestation and kernel-global errors, (2) the scope of error propagation is mostly process-local in Linux, and (3) global propagation occurs with low probability. Even if an error corrupts a global data structure, other processes merely access to them.",
keywords = "Error propagation, Fault injection, Rejuvenation, Software faults, System dependability",
author = "Takeshi Yoshimura and Hiroshi Yamada and Kenji Kono",
year = "2013",
doi = "10.2197/ipsjtrans.6.55",
language = "English",
volume = "6",
pages = "55--64",
journal = "IPSJ Online Transactions",
issn = "1882-6660",
publisher = "Information Processing Society of Japan",
number = "1",

}

TY - JOUR

T1 - Using fault injection to analyze the scope of error propagation in linux

AU - Yoshimura, Takeshi

AU - Yamada, Hiroshi

AU - Kono, Kenji

PY - 2013

Y1 - 2013

N2 - Operating systems (OSes) are crucial for achieving high availability of computer systems. Even if applications running on an operating system are highly available, a bug inside the kernel may result in a failure of the entire software stack. The objective of this study is to gain some insight into the development of the Linux kernel that is more resilient against software faults. In particular, this paper investigates the scope of error propagation. The propagation scope is process-local if the erroneous value is not propagated outside the process context that activated it. The scope is kernel-global if the erroneous value is propagated outside the process context that activated it. The investigation of the scope of error propagation gives us some insight into 1) defensive coding style, 2) reboot-less rejuvenation, and 3) general recovery mechanisms of the Linux kernel. For example, if most errors are process-local, we can rejuvenate the kernel without reboots because the kernel can be recovered simply by killing faulty processes. To investigate the scope of error propagation, we conduct an experimental campaign of fault injection on Linux 2.6.18, using a kernel-level fault injector widely used in the OS community. Our findings are (1) our target kernel (Linux 2.6.18) is coded defensively. This defensive coding style contributes to lower rates of error manifestation and kernel-global errors, (2) the scope of error propagation is mostly process-local in Linux, and (3) global propagation occurs with low probability. Even if an error corrupts a global data structure, other processes merely access to them.

AB - Operating systems (OSes) are crucial for achieving high availability of computer systems. Even if applications running on an operating system are highly available, a bug inside the kernel may result in a failure of the entire software stack. The objective of this study is to gain some insight into the development of the Linux kernel that is more resilient against software faults. In particular, this paper investigates the scope of error propagation. The propagation scope is process-local if the erroneous value is not propagated outside the process context that activated it. The scope is kernel-global if the erroneous value is propagated outside the process context that activated it. The investigation of the scope of error propagation gives us some insight into 1) defensive coding style, 2) reboot-less rejuvenation, and 3) general recovery mechanisms of the Linux kernel. For example, if most errors are process-local, we can rejuvenate the kernel without reboots because the kernel can be recovered simply by killing faulty processes. To investigate the scope of error propagation, we conduct an experimental campaign of fault injection on Linux 2.6.18, using a kernel-level fault injector widely used in the OS community. Our findings are (1) our target kernel (Linux 2.6.18) is coded defensively. This defensive coding style contributes to lower rates of error manifestation and kernel-global errors, (2) the scope of error propagation is mostly process-local in Linux, and (3) global propagation occurs with low probability. Even if an error corrupts a global data structure, other processes merely access to them.

KW - Error propagation

KW - Fault injection

KW - Rejuvenation

KW - Software faults

KW - System dependability

UR - http://www.scopus.com/inward/record.url?scp=84880601864&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880601864&partnerID=8YFLogxK

U2 - 10.2197/ipsjtrans.6.55

DO - 10.2197/ipsjtrans.6.55

M3 - Article

AN - SCOPUS:84880601864

VL - 6

SP - 55

EP - 64

JO - IPSJ Online Transactions

JF - IPSJ Online Transactions

SN - 1882-6660

IS - 1

ER -