KLI

검색

Ulsan Univ. Repository Thesis General Graduate School Computer Engineering & Information Technology 2. Theses (Ph.D)

Efficient Vision Transformers with Multi-Scale and Partial Attentions for Object Recognition

Metadata Downloads

Abstract: Recently, Vision Transformers have become dominant methods in processing visual data, achieving promising performances on image classification, object detection, segmentation, and multimodal foundation models. As a key component of the Transformer, self-attention has high flexibility in capturing long-range dependencies and great generalization capability. Modeling global token-to-token interactions in an input-adaptive manner defines a new paradigm in feature extraction. With high general modeling capability and scalability to model and data size, global self-attention requires quadratic complexity with the token lengths and has weak inductive bias such as locality and relative positions between tokens. When transferring vision Transformers to downstream tasks, the model suffers a huge computational cost. Consequently, deploying the original Transformer models on real-world platforms results in high latency and energy consumption. This motivates us to develop efficient vision Transformers for object recognition that improve the efficiency of Transformer and augment inductive biases. This research has three aims: (Aim 1) integrating self-attention layers into earlier stages of hierarchical backbone networks, (Aim 2) exchanging information across non-overlapped window self-attentions, and (Aim 3) identifying computation redundancy of sparse attention and proposing partial attention that learns spatial interactions more efficiently. Aim 1 is presented in Chapter 3 entitled Efficient Multi-scale Spatial Interactions (EMSNet). The EMSNet takes advantage of the hybrid network that adopts the merits of convolution and self-attention operations in hierarchical networks to achieve better visual representation. Each stage of the EMSNet efficiently models both short-range and long-range spatial interactions via the design of the multi-scale tokens. In each block, the efficient combination of the depthwise convolution, coordinate depthwise convolution, C-MHSA, and global multi-head self-attention (G-MHSA) are performed via channel splitting strategy, extracting wide ranges of frequencies and multi-order interactions. Aim 2 is presented in Chapter 4 entitled Exchange Information across Non-overlapped Local Self-Attentions via Mixing Abstract Tokens, called MAT Transformer. This method enlarges receptive fields and modeling capability of local self-attention, efficiently exchanging information across non-overlapped windows via Mixing Abstract Tokens (MAT). The intuitive idea of the MAT Transformer is to use learnable abstract tokens attached to windows. Via self-attention, each abstract token learns abstract information from each corresponding window. Hence, mixing all abstract tokens via a Transformer encoder helps to exchange information between local windows and results in global context modeling. Aim 3 is presented in Chapter 5 entitled Efficient Vision Transformers with Partial Attention, named PartialFormer. In this chapter, we find out that there exist high similarities between attention weights and incur computation redundancy. To address this issue, this research proposes novel attention, called partial attention, that significantly reduces computation redundancy in Multi-head self-attention (MSA) and enhances the diversity of attention heads. Each query in our attention only interacts with a small set of relevant tokens. Extensive experiments are conducted and evaluated with various tasks such as image classification, object detection, and object segmentation. As a result, our methods: EMSNet, MAT Transformer, and PartialFormer, achieve promising performances across tasks. For example, the MAT-2 achieves 79.0% Top-1 accuracy on ImageNet-1K and outperforms PVTv2-B0 by 8.5% under similar latency on a CPU device. The MAT-4 surpasses Swin-T by 1.8% mIoU with only 70% GFLOPs.