Object detection

Research using DINOv2 as a backbone for object detection:

DINOv2 ❌

DINOv2 ✅

“NVIDIA has also released a foundational model called NV-Dinov2, which is available through the NVIDIA AI Enterprise program. NV-Dinov2 is a visual foundational model trained on an NVIDIA proprietary large scale dataset.” NV-DINOv2

  • NVIDIA provides CLIP VIT and DINO VIT backbones for object detection and segmentation (closed source)

    • This signals that it is not only possible but actually useful in production (the tao toolkit specifically markets to providing enterprise-ready vision transformers)
    • However it also very specifically states the inferior performance of vits compared with specifically trained dense-prediction networks:

      “To mitigate the inferior performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ architecture. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense prediction tasks.”

  • Exploring Plain Vision Transformer Backbones for Object Detection

  • SimPLR - A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation

    • Improves over ViTDet