基于重计算的深度学习加速器容错设计

王乾龙，许达文

（合肥工业大学电子科学与应用物理学院，安徽合肥 230601）

摘要

2D计算阵列由于高并行性且通信简单,在深度学习加速器(deep learning accelerator, DLA)中经常负责处理卷积的大量计算,若出现硬件故障,则会导致计算错误,从而造成预测精度大幅下降。为了修复2D计算阵列中的故障,文章提出一种用于容错DLA的重计算结构(recomputing architecture, RCA),与传统的在阵列中添加冗余的即时故障修复策略不同,它具有一组基于冗余的重计算单元(recomputing unit, RCU),可以在稍后的周期中一对一地进行故障单元的重新计算。实验结果表明,与之前的容错方案相比,该文提出的方法显示出更高的故障修复能力和可扩展性,并且芯片面积占用更少。

关键词

重计算结构（RCA）；深度学习加速器（DLA）；容错；重计算

中图分类号：TP183

文献标志码：A

文章编号：1003-5060（2023）01-0054-06

Fault-tolerant design of deep learning accelerator based on recomputing

WANG Qianlong, XU Dawen

(School of Electronic Science and Applied Physics, Hefei University of Technology, Hefei 230601, China)

Abstract

Due to its high parallelism and simple communication, 2D computing arrays in deep learning accelerator (DLA) are often responsible for processing a large number of calculations of convolution. If there is a hardware failure, the calculation error will result in a significant decrease in the prediction accuracy. In order to fix faults in 2D computing arrays, this paper proposes a recomputing architecture (RCA) for fault-tolerant DLA, which is different from the traditional real-time fault repair strategy of adding redundancy in the array. It has a set of redundancy-based recomputing units (RCU) that can be used to recomputing the failure units one-to-one later in the cycle. Experimental results show that, compared with the previous fault-tolerant schemes, the proposed method has higher fault repair capability and scalability, and less chip area occupancy.

Keywords

recomputing architecture(RCA); deep learning accelerator(DLA); fault tolerance; recomputing

收稿日期：2021-06-23

修回日期：2021-11-21

基金项目：国家自然科学基金资助项目（61834006）