Training data-efficient image transformers & distillation through attention

Contribution 세미나 2024. 4. 18. 10:49

ICML 2021, 4795회 인용

Introduction

우리가 많이 알고 있는 DeiT 논문.

KD 쪽을 보려다 보니 base가 되는 논문을 한번 정리할 필요가 있다고 생각.

DeiT는 기본적으로 ViT와 동일한 구조로 되어 있음. 다만 추가적인 튜닝을 통해 성능을 향상시킴.

또한 기존 ViT는 좋은 성능을 내기 위한 많은 데이터들이 필요 (CNN과 달리 inductive bias가 없기때문)

이러한 단점을 KD를 통해 극복하는게 기본 컨셉.

Proposed method

추가적인 튜닝을 통해 성능을 향상시키는 방안은 실험 파트에서 설명 예정

먼저 KD 관련 내용 부터 설명 시작

여기서는 soft/hard distillation 정의

Soft distillation

기존 cross entropy에 추가적으로 prediction 결과와의 KL 비교

Hard-label distillation

Teacher에서 예측한 label을 cross entropy로 loss 계산
(augmentation을 통해 target이 되는 class의 중요도가 달라질수도 있어서 hard label 정보도 효과적일수 있다고 생각)

여기서는 hard label을 바로 사용하지 않고 label smoothing 사용하여 soft label로 변환하여 사용

단순히 loss를 변경하는것과 달리 transformer만의 KD 방식을 추가 제안

Distillation token을 따로 추가하여 KD를 진행한다.

그림과 같이 각각 token에 맞는 loss를 학습시켜서 prediction을 하면 기존보다 성능이 좋아짐

Experiment result

DeiT와 ViT와의 차이

DeiT에 대한 ablation study

이미지 사이즈 224로 pre-training이후 384로 fine-tuning

DeiT 모델에 따른 구조 및 속도 비교

Teacher 모델로 transformer보다 convnet 모델이 더 성능 향상에 도움이 됨
(Inductive bias에 대한 정보를 주기 때문에)

KD 방식에 따른 성능 비교, distil token을 사용 했을때 성능이 더 좋음

Distil token을 사용해서 KD를 한경우 convnet(teacher)과 더 유사하게 나옴, 즉 teacher의 inductive bias가 좀 더 잘 전파되었다고 판단.

전체 성능 비교

Distillation에 따른 성능 비교, epochs에 따른 성능 비교.

Masked Generative Distillation (0)	2024.04.18
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation (0)	2024.04.18
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection (0)	2024.04.18
Learning to Segment Every Thing (0)	2024.04.17
Detecting Twenty-thousand Classes using Image-level Supervision (1)	2024.04.17

심심할때 읽는 논문 심심할때 읽는 논문