Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture

Contribution 세미나 2024. 4. 17. 09:53

CVPR 2022, 61회 인용

보려는 논문의 배경의 배경이 되는 논문

General purpose vision(GPV) system 제안

input: image, text

output: bounding box or/and text

task-agnostic 모델을 만들어 보자

task: classification, localization, visual question answering, captioning

task에 따른 head가 따로 필요 없음

zero-shot referring expressions에 효과적임

새로운 concepts across skills에 효과적임
(ex. person class가 VQA에서만 학습이 되었는데 localization문제에서 person 검출이 가능)

기존 model은 task 별로 독립적인 head가 존재함

여기서는 task를 text input으로 판단할 수 있도록 학습하는 느낌

학습에서는

localization: bounding box ground truth 사용

classification, VQA, captioning: text ground truth 사용

loss는

text: maximize the log-likelihood of the ground truth text

vision: DETR’s hungarian loss

실험

COCO-SCE(skill-concept evaluation)로 기존 COCO-DB를 나눔

Class-agnostic Object Detection with Multi-modal Transformer (0)	2024.04.17
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding (0)	2024.04.17
PromptDet: Towards Open-vocabulary Detection using Uncurated Images (0)	2024.04.17
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation (0)	2024.04.17
CLIPood: Generalizing CLIP to Out-of-Distributions (0)	2024.04.17

심심할때 읽는 논문 심심할때 읽는 논문