MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

Contribution 세미나 2024. 4. 18. 11:20

arxiv 2021, 135회 인용

Introduction

이전 MiniLM의 업그레이드 버젼

MiniLM의 한계점은 teacher와 student 사이의 layer 개수는 동일할 필요가 없지만,

Multi-head의 개수는 동일해야 한다.

MiniLMv2에서는 multi-head 개수가 달라도 distillation이 가능하다.

Proposed Method

컨셉은 간단함

multi-head attention에 사용되는 vector들을 하나로 concatenate한 이후 원하는 개수로 split

(teacher, student 전부 진행)

이후 각각 self-attention relations를 만든다.

여기서 특이점은 기존 QV가 아닌 같은 type끼리 relation matrix를 구해서 비교한다.

수식상으로는 모든 조합을 고려하지만, training cost를 고려하여 같은 type만 고려했다고 함

Experiment result

bast student 모델의 multi-head 개수는 12

relation 개수는 48, 64개 사용 (teacher에 따라 변경)

large teacher의 경우 마지막 layer 보다는 upper middle layer를 사용하는게 성능이 더 좋다..

같은 type만 비교하는데 training cost 고려했을때 효과적

relation head 개수에 따른 성능 비교

모든 case를 전부 비교하면 성능이 더 좋지만 training cost가 너무 많이 든다.

Sample Prior Guided Robust Model Learning to Suppress Noisy Labels (0)	2024.04.18
Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer (0)	2024.04.18
MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (0)	2024.04.18
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation (0)	2024.04.18
Masked Generative Distillation (0)	2024.04.18

심심할때 읽는 논문 심심할때 읽는 논문