Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
기존 Mask2Former에 detection head를 적용하면 성능이 하락

기존 DINO에 segmentation head를 적용하면 성능이 하락

Detection과 segmentation을 동시에 잘할 수 있는 알고리즘 만들기

DINO에 mask2former 컨셉을 적용
Unified query selection for mask:
The classification score of each token is considered as the confidence to select top-ranked features and feed them to the decoder as content queries. The selected features also regress boxes and dot-product with the high-resolution feature map to predict masks. The predicted boxes and masks will be supervised by the ground truth and are considered as initial anchors for the decoder.
Mask-enhanced anchor box initialization:
We derive boxes from the predicted masks as better anchor box initialization for the decoder
Unified denoising for mask:
we can treat boxes as a noised version of masks, and train the model to predict masks given boxes as a denoising task. The given boxes for mask prediction are also randomly noised for more efficient mask denoising training
Hybrid matching:
