Dynamic fault recovery in mesh‐connected parallel computers

Takashi Yokota, Hideharu Amano, Hideo Aiso

Research output: Contribution to journalArticle

Abstract

A trend in computer development aiming at high‐speed processing is high‐level parallel processing using a large number of processing elements. This scheme is becoming more realistic with the recent progress of VLSI technology. On the other hand, there arises a problem of how to cope with the generation of faults with the increased number of processing elements. A faulttolerant computer with multiple redundancy has been developed, but no method has been presented in the parallel computer environment whereby sufficient redundancy against fault can be provided, to recover from fault and to continue the computation without a system down. In general, completeness of data is lost by a fault. In the field of numerical computation, however, there are problems with less stringent requirement for completeness of data (e.g., in iterative solution of a system of equations). This paper discusses the case where such a problem is solved by a parallel computer with lattice topology. Three structural types are proposed for dynamic fault recovery during execution, mutual connection and the method of recovery. The result of evaluation by simulation is shown.

Original languageEnglish
Pages (from-to)10-18
Number of pages9
JournalSystems and Computers in Japan
Volume17
Issue number7
DOIs
Publication statusPublished - 1986

    Fingerprint

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Information Systems
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this