KLI

Dynamic Circular Convolution for Image Classification

Metadata Downloads
Abstract
In recent years, Vision Transformer (ViT) has achieved an outstanding landmark in disentangling diverse information of visual inputs, superseding traditional Convolutional Neural Networks (CNNs). Although CNNs have strong inductive biases such as translation equivariance and relative positions, they require deep layers to model long-range dependencies in input data. This strategy results in high model complexity. Compared to CNNs, ViT can extract global features even in earlier layers through token-to-token interactions without considering geometric location of pixels. Therefore, ViT models are data-efficient and data-hungry, in another work, learning data-dependent and producing high performances on large-scale datasets. Nonetheless, ViT has quadratic complexity with the length of the input token because of the natural dot product between query and key matrices. Different from ViTs-and-CNNs-based models, this paper proposes a Dynamic Circular Convolution Network (DCCNet) that learns token-to-token interactions in Fourier domain, relaxing model complexity to O(Nlog(N) instead of O(N2)
in ViTs, and global Fourier filters are treated dependently and dynamically rather than independent and static weights in conventional operators. The token features, dynamic filters in spatial domain are transformed to frequency domain via Fast Fourier Transform (FFT). Dynamic circular convolution, in lieu of matrix multiplication in Fourier domain, between Fourier features and transformed filters are performed in a separable way along channel dimension. The output of circular convolution is revered back to spatial domain by Inverse Fast Fourier Transform (IFFT). Extensive experiments are conducted and evalued on large-scaled dataset ImageNet1k and small dataset CIFAR100. On ImageNet1k, the proposed model achieves 75.4% top-1 accuracy and 92.6% top-5 accuracy with the budget 7.5M paramaters under similar setting with ViT-based models, surpassing ViT and its variants. When fine-tuning the model on smaller dataset, DCCNet still works well and gets the state-of-the-art performances. Both evaluating the model on large and small datasets verifies the effectiveness and generalization capabilities of the proposed method.
Issued Date
2023
Xuan-Thuy Vo
Duy-Linh Nguyen
Adri Priadana
Kang-Hyun Jo
Type
Article
Keyword
Vision TransformerDynamic Global WeightsFourier TransformImage Classification
DOI
10.1007/978-981-99-4914-4_4
URI
https://oak.ulsan.ac.kr/handle/2021.oak/17245
Publisher
Communications in Computer and Information Science
Language
영어
ISSN
1865-0929
Citation Volume
1857
Citation Number
1
Citation Start Page
43
Citation End Page
55
Appears in Collections:
Engineering > IT Convergence
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.