Perceiver: General Perception with Iterative Attention

Contribution 세미나

Perceiver: General Perception with Iterative Attention

PaperGPT 2024. 4. 5. 15:40

2021, 453회 인용

하나의 transformer 모델로 다양한 modality를 고려

In this paper we introduce the Perceiver, a model designed to handle arbitrary configurations of different modalities using a single Transformer-based architecture.

기존 transformer의 경우 입력 크기에 제곱해서 연산량이 증가, 따라서 이미지도 grid 형태로 축소해서 많이 사용함

여기서는 이미지 pixel를 그대로 활용할 수 있음

결국 latent array를 업데이트 하는 방식 활용 (DETR에서 learnable query 사용하는 것과 비슷)

상대적으로 dimension이 적은 latent array를 활용 함으로써 latent transformer을 좀 더 깊게 세팅 할 수 있음

입력 크기와 모델 depth를 decoupling 시키자 → O(MN + LN^2) (예시로 48개의 latent transformer block 사용)

여기서 발생하는 문제: 초기에 dimension을 줄여버리면 네트워크가 입력 정보의 필요한 정보를 전부 커버하지 못하지 않을까?

→ 중간마다 입력정보를 cross attention

또한 Cross attention과 latent transformer의 weight는 다음 반복하는 부분에서 sharing
-> parameter 개수 줄이는 효과 + overfitting 방지