Oral Sessions

回到索引頁

Oral S12: AI Accelerators (I)

Aug. 6, 2020 11:10 AM - 12:10 PM

Room: 薔薇廳
Session chair：呂仁碩教授

PTLL-BNN: Physically Tightly Coupled, Logically Loosely Coupled, Near-Memory BNN Accelerator 發表編號：O12-1時間：11:10 - 11:25
論文編號：0069 Yun-Chen Lo, Chih-Chen Yeh, Chia-Chun Wang, Yu-Chun Kuo, Yun-Sheng Chang, Jian-Hao Huang, Jun-Shen Wu, Wen-Chien Ting, Tai-Hsing Wen and Ren-Shuo Liu Department of Electrical Engineering, National Tsing Hua University In this paper, a physically tightly coupled, logically loosely coupled, near-memory binary neural network accelerator (PTLL-BNN) is designed and fabricated. Both architecture-level and circuit-level optimizations are presented. From the perspective of processor architecture, the PTLL-BNN includes two new design choices. First, the proposed BNN accelerator is placed close to the SRAM of the embedded processors (i.e., physically tightly coupled and near-memory); thus, the extra SRAM cost that is incurred by the accelerator is as low as 0.5 KB. Second, the accelerator is a memory-mapped IO (MMIO) device (i.e., logically loosely coupled), so all embedded processors can be equipped with the proposed accelerator without the burden of changing their compilers and pipelines. From the circuit perspective, this work employs four techniques to optimize the power and costs of the accelerator. First, this design adopts a unified input-kernel-output memory instead of separate ones, which many previous works adopt. Second, the data layout that this work chooses increases the sequentiality of the SRAM accesses and reduces the buffer size of storing the intermediate values. Third, this work innovatively proposes to fuse the max-pooling, batch-normalization, and binarization layers of the BNNs to significantly reduce the hardware complexity. Finally, a novel methodology of generating the scheduler hardware of the accelerator is included. We fabricate the accelerator using the TSMC 180 nm technology. The chip measurement results reach 91 GOP/s on average (307 GOP/s at peak) at 200 MHz. The achieved GOP/s per million logic gates and GOP/s per KB SRAM are 2.6 to 237 times greater than that of previous works, respectively. We also realize an FPGA system to demonstrate the recognition of CIFAR-10/100 images using the fabricated accelerator.

Design of a Compute in Memory Circuit Using NCFET 發表編號：O12-2時間：11:25 - 11:40
論文編號：0205 Chia-Heng Lee, Ying-Tuan Hsu, Tsung-Te Liu and Tzi-Dar Chiueh Graduate Inst. of Electronics Engineering, National Taiwan University In-memory computation is a new technique for computation acceleration in machine learning (ML). Since in-memory computation combines weight storage and computation, it can execute the convolution computation in a highly parallel fashion. In this paper, we present the design of a new computation in memory circuit based on the negative capacitance field effect transistor (NCFET). In addition, we show the advantage of this circuit by comparing its energy and execution time with the traditional CMOS–based computation in memory circuit.

Efficient Approximate Computing of Residue Number System for Neural Network Acceleration 發表編號：O12-3時間：11:40 - 11:55
論文編號：0098 Liang-Yu Lin¹, Jerrae Schroff², Tsu-Ping Lin¹ and Tsung-Chu Huang¹ ¹Department of Electronics Engineering, National Changhua University of Education, Changhua, Taiwan ²Cogito Academy, Orinda, California US Residue Number Systems can simultaneously solve several major problems of neural network including acceleration, power consumption, area overhead and fault tolerance. However, three major issues remain in the most recent work: (1) dynamic range inflation issue for all precision systems, (2) the equivalent issue on sign detection, magnitude comparison and overflow, and (3) long right shifts in CORDIC operations. In this paper, we propose a perfect deflation factor for approximate Chinese remainder theorem to limit the dynamic range inflation in long datapaths including typical neural networks. A systematic approach is then developed to design a low power, compact, fast and reliable neural network automatically. To our knowledge, this is the first paper presenting the efficient approximate computing for residue number system with more than 2.5 times of acceleration but without any dynamic range inflation issue.

High Performance Dilated and Transposed Convolution Accelerator with Decomposition 發表編號：O12-4時間：11:55 - 12:10
論文編號：0163 Kuo Wei, Chang and Tian-Sheuan Chang Department of Electronics Engineering, National Chiao Tung University * Abstract is not available.