Dynamic Circular Convolution for Image Classification
- Abstract
- In recent years, Vision Transformer (ViT) has achieved an outstanding landmark in disentangling diverse information of visual inputs, superseding traditional Convolutional Neural Networks (CNNs). Although CNNs have strong inductive biases such as translation equivariance and relative positions, they require deep layers to model long-range dependencies in input data. This strategy results in high model complexity. Compared to CNNs, ViT can extract global features even in earlier layers through token-to-token interactions without considering geometric location of pixels. Therefore, ViT models are data-efficient and data-hungry, in another work, learning data-dependent and producing high performances on large-scale datasets. Nonetheless, ViT has quadratic complexity with the length of the input token because of the natural dot product between query and key matrices. Different from ViTs-and-CNNs-based models, this paper proposes a Dynamic Circular Convolution Network (DCCNet) that learns token-to-token interactions in Fourier domain, relaxing model complexity to O(Nlog(N) instead of O(N2)
in ViTs, and global Fourier filters are treated dependently and dynamically rather than independent and static weights in conventional operators. The token features, dynamic filters in spatial domain are transformed to frequency domain via Fast Fourier Transform (FFT). Dynamic circular convolution, in lieu of matrix multiplication in Fourier domain, between Fourier features and transformed filters are performed in a separable way along channel dimension. The output of circular convolution is revered back to spatial domain by Inverse Fast Fourier Transform (IFFT). Extensive experiments are conducted and evalued on large-scaled dataset ImageNet1k and small dataset CIFAR100. On ImageNet1k, the proposed model achieves 75.4% top-1 accuracy and 92.6% top-5 accuracy with the budget 7.5M paramaters under similar setting with ViT-based models, surpassing ViT and its variants. When fine-tuning the model on smaller dataset, DCCNet still works well and gets the state-of-the-art performances. Both evaluating the model on large and small datasets verifies the effectiveness and generalization capabilities of the proposed method.
- Issued Date
- 2023
Xuan-Thuy Vo
Duy-Linh Nguyen
Adri Priadana
Kang-Hyun Jo
- Type
- Article
- Keyword
- Vision Transformer; Dynamic Global Weights; Fourier Transform; Image Classification
- DOI
- 10.1007/978-981-99-4914-4_4
- URI
- https://oak.ulsan.ac.kr/handle/2021.oak/17245
- Publisher
- Communications in Computer and Information Science
- Language
- 영어
- ISSN
- 1865-0929
- Citation Volume
- 1857
- Citation Number
- 1
- Citation Start Page
- 43
- Citation End Page
- 55
-
Appears in Collections:
- Engineering > IT Convergence
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.