AI Computer Vision Research

DINOv2: A Self-supervised Vision Transformer Model

A family of foundation models producing universal features suitable for image-level visual tasks (image classification, instance retrieval, video understanding) as well as pixel-level visual tasks (depth estimation, semantic segmentation).
Try the demos

The research

DINOv2: Learning Robust Visual Features Without Supervision

A family of models to encode visual features, evaluated across 30 different benchmarks covering 8 types of visual tasks from image classification to monocular depth estimation.

Demonstrations

img

Depth Estimation

State-of-the-art results and strong generalization on estimating depth from a single image.

img

Semantic Segmentation

Competitive results without any fine-tuning on clustering an images into object classes.

img

Instance Retrieval

Directly use frozen features to find art pieces similar to a given image from a large art collection.

img

Dense Matching

DINOv2 patch features can be used to consistently map all parts of an image without supervision.

img

Sparse Matching

Compare DINOv2 patch features across two images to match their most similar parts.

Data, approach and other results

A large pretraining dataset of 142 million images was assembled and curated from webcrawled data to cover well a number of key visual domains.

The approach builds upon DINO and iBOT with several adjustments to improve both the quality of the features as well as the efficiency of pretraining.

Frozen features produced by the models are also evaluated on other visual tasks like coarse and fine-grained visual classification as well as video understanding. The results are thoroughly compared against other self-supervised and weakly supervised alternatives.

For more details, see the blog post and paper.