合肥工业大学校徽 合肥工业大学学报自科版

导航菜单

基于重计算的深度学习加速器容错设计

Fault-tolerant design of deep learning accelerator based on recomputing

期刊信息

合肥工业大学(自然科学版),2023年1月,第46卷第1期:54-59

DOI: 10.3969/j.issn.1003-5060.2023.01.009

作者信息

王乾龙,许达文

(合肥工业大学电子科学与应用物理学院,安徽 合肥 230601)

摘要和关键词

摘要: 2D计算阵列由于高并行性且通信简单,在深度学习加速器(deep learning accelerator, DLA)中经常负责处理卷积的大量计算,若出现硬件故障,则会导致计算错误,从而造成预测精度大幅下降。为了修复2D计算阵列中的故障,文章提出一种用于容错DLA的重计算结构(recomputing architecture, RCA),与传统的在阵列中添加冗余的即时故障修复策略不同,它具有一组基于冗余的重计算单元(recomputing unit, RCU),可以在稍后的周期中一对一地进行故障单元的重新计算。实验结果表明,与之前的容错方案相比,该文提出的方法显示出更高的故障修复能力和可扩展性,并且芯片面积占用更少。

关键词: 重计算结构(RCA);深度学习加速器(DLA);容错;重计算

Authors

WANG Qianlong, XU Dawen

(School of Electronic Science and Applied Physics, Hefei University of Technology, Hefei 230601, China)

Abstract and Keywords

Abstract: Due to its high parallelism and simple communication, 2D computing arrays in deep learning accelerator (DLA) are often responsible for processing a large number of calculations of convolution. If there is a hardware failure, the calculation error will result in a significant decrease in the prediction accuracy. In order to fix faults in 2D computing arrays, this paper proposes a recomputing architecture (RCA) for fault-tolerant DLA, which is different from the traditional real-time fault repair strategy of adding redundancy in the array. It has a set of redundancy-based recomputing units (RCU) that can be used to recomputing the failure units one-to-one later in the cycle. Experimental results show that, compared with the previous fault-tolerant schemes, the proposed method has higher fault repair capability and scalability, and less chip area occupancy.

Keywords: recomputing architecture(RCA); deep learning accelerator(DLA); fault tolerance; recomputing

基金信息

国家自然科学基金资助项目(61834006)

个人中心