TY - GEN
T1 - FAWS
T2 - 20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022
AU - Shresthamali, Shaswot
AU - He, Yuan
AU - Kondo, Masaaki
N1 - Funding Information:
ACKNOWLEDGMENT This work was supported in part by the JST MIRAI Program Grant Number JPMJMI18E1.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - The idea of using inexact computation for overprovisioned DNNs (Deep Neural Networks) to decrease power and la-tency at the cost of minor accuracy degradation has become very popular. However, there is still no general method to schedule DNN computations on a given hardware platform to effectively implement this idea without loss in computational efficiency. Most contemporary methods require specialized hardware, ex-tensive retraining and hardware-specific scheduling schemes. We present FAWS: Fault-Aware Weight Scheduler for scheduling DNN computations in heterogeneous and faulty hardware. Given a trained DNN model and a hardware fault profile, our scheduler is able to recover significant accuracy during inference even at high fault rates. FAWS schedules the computations such that the low priority ones are allocated to inexact hardware. This is achieved by shuffling (exchanging) the rows of the matrices. The best shuffling order for a given DNN model and hardware fault profile is determined using Genetic Algorithms (GA). We simulate bitwise errors on different model architectures and datasets with different types of fault profiles and observe that FAWS can recover up to 30% of classification accuracy even at high fault rates (which correspond to approximately 50 % power savings).
AB - The idea of using inexact computation for overprovisioned DNNs (Deep Neural Networks) to decrease power and la-tency at the cost of minor accuracy degradation has become very popular. However, there is still no general method to schedule DNN computations on a given hardware platform to effectively implement this idea without loss in computational efficiency. Most contemporary methods require specialized hardware, ex-tensive retraining and hardware-specific scheduling schemes. We present FAWS: Fault-Aware Weight Scheduler for scheduling DNN computations in heterogeneous and faulty hardware. Given a trained DNN model and a hardware fault profile, our scheduler is able to recover significant accuracy during inference even at high fault rates. FAWS schedules the computations such that the low priority ones are allocated to inexact hardware. This is achieved by shuffling (exchanging) the rows of the matrices. The best shuffling order for a given DNN model and hardware fault profile is determined using Genetic Algorithms (GA). We simulate bitwise errors on different model architectures and datasets with different types of fault profiles and observe that FAWS can recover up to 30% of classification accuracy even at high fault rates (which correspond to approximately 50 % power savings).
KW - Deep Leaning Accelerators
KW - Fault Tolerance
KW - GPU
KW - Genetic Algorithms
UR - http://www.scopus.com/inward/record.url?scp=85151979473&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85151979473&partnerID=8YFLogxK
U2 - 10.1109/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00033
DO - 10.1109/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00033
M3 - Conference contribution
AN - SCOPUS:85151979473
T3 - Proceedings - 20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022
SP - 204
EP - 212
BT - Proceedings - 20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 December 2022 through 19 December 2022
ER -