Book Group Author
IEEESource
Page
7-12DOI
10.1109/ICCCAS62034.2024.10652850Published
2024Indexed
2024-12-11Document Type
Proceedings PaperConference
Meeting
13th IEEE International Conference on Communications, Circuits and Systems (ICCCAS)Location
Xiamen, PEOPLES R CHINADate
MAY 10-12, 2024Sponsor
IEEEAbstract
In recent years, transformer models have shown great potential in many fields of applications, including natural language processing and computer vision. To pursue better performance, transformer model size has become larger and larger which leads to a significant increase in computation complexity and memory intensity. All these factors make the transformer models challenging to be deployed in edge devices. To reduce the computation cost of transformer models, in this paper, an FPGA-based dynamic sparse attention accelerator is proposed for transformer model processing at the edge. The proposed accelerator is designed based on the fact that not all tokens in the self-attention are equally important. Accordingly, a low-precision prediction method by using dynamic mask is proposed to remove the sparse tokens in the self-attention module. In addition, an efficient Softmax computation method is proposed to reduce the memory requirement during Softmax computation. The proposed accelerator is implemented in the Xilinx Zynq UltraScale+ device and improved speed performance and energy efficiency can be achieved when compared with previous designs.