Book Group Author
IEEESource
Page
18-23DOI
10.1109/ICCCAS62034.2024.10651591Published
2024Indexed
2024-12-11Document Type
Proceedings PaperConference
Meeting
13th IEEE International Conference on Communications, Circuits and Systems (ICCCAS)Location
Xiamen, PEOPLES R CHINADate
MAY 10-12, 2024Sponsor
IEEEAbstract
Recently, transformer models have been widely deployed in natural language processing and image processing. However, its superior performance comes with high amount of parameters and computations which make it difficult to deploy transformer models in resource limited devices. To reduce the computation cost of transformer models, in this paper, an improved network pruning method is proposed. In the proposed pruning method, the parameter matrix is decomposed into blocks of a specific size. Then, pruning is applied to each block so that the number of parameters remaining in each block is the same. To further reduce the memory requirement of the parameters, an efficient memory storage pattern for sparse parameters is also proposed in this paper. Finally, by combining the proposed methods, an energy efficient transformer accelerator architecture is proposed. The proposed accelerator is implemented in FPGA devices and implementation results show that the proposed design can significantly improve the speed performance and energy efficiency when compared with previous designs.