FAWS: Fault-Aware Weight Scheduler for DNN Computations in Heterogeneous and Faulty Hardware

Shaswot Shresthamali, Yuan He, Masaaki Kondo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The idea of using inexact computation for overprovisioned DNNs (Deep Neural Networks) to decrease power and la-tency at the cost of minor accuracy degradation has become very popular. However, there is still no general method to schedule DNN computations on a given hardware platform to effectively implement this idea without loss in computational efficiency. Most contemporary methods require specialized hardware, ex-tensive retraining and hardware-specific scheduling schemes. We present FAWS: Fault-Aware Weight Scheduler for scheduling DNN computations in heterogeneous and faulty hardware. Given a trained DNN model and a hardware fault profile, our scheduler is able to recover significant accuracy during inference even at high fault rates. FAWS schedules the computations such that the low priority ones are allocated to inexact hardware. This is achieved by shuffling (exchanging) the rows of the matrices. The best shuffling order for a given DNN model and hardware fault profile is determined using Genetic Algorithms (GA). We simulate bitwise errors on different model architectures and datasets with different types of fault profiles and observe that FAWS can recover up to 30% of classification accuracy even at high fault rates (which correspond to approximately 50 % power savings).

Original languageEnglish
Title of host publicationProceedings - 20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages204-212
Number of pages9
ISBN (Electronic)9781665464970
DOIs
Publication statusPublished - 2022
Event20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022 - Melbourne, Australia
Duration: 2022 Dec 172022 Dec 19

Publication series

NameProceedings - 20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022

Conference

Conference20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022
Country/TerritoryAustralia
CityMelbourne
Period22/12/1722/12/19

Keywords

  • Deep Leaning Accelerators
  • Fault Tolerance
  • GPU
  • Genetic Algorithms

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Decision Sciences (miscellaneous)
  • Information Systems and Management
  • Renewable Energy, Sustainability and the Environment
  • Communication

Fingerprint

Dive into the research topics of 'FAWS: Fault-Aware Weight Scheduler for DNN Computations in Heterogeneous and Faulty Hardware'. Together they form a unique fingerprint.

Cite this