AI Computer Vision Research

DINOv2: A Self-supervised Vision Transformer Model

A family of foundation models producing universal features suitable for image-level visual tasks (image classification, instance retrieval, video understanding) as well as pixel-level visual tasks (depth estimation, semantic segmentation).

Try the demos

The research

DINOv2: Learning Robust Visual Features Without Supervision

A family of models to encode visual features, evaluated across 30 different benchmarks covering 8 types of visual tasks from image classification to monocular depth estimation.

Read the blog post

The DINOv2 family of models drastically improves over the previous state of the art in self-supervised learning (SSL), and reaches performance comparable with weakly-supervised features (WSL).

DINOv2 models are pretrained without supervision on a large, curated and diverse dataset of 142 million images.

DINOv2 models demonstrate strong out-of-distribution performance and the produced features are usable without requiring any fine-tuning.

Demonstrations

Depth Estimation

State-of-the-art results and strong generalization on estimating depth from a single image.

Try the demo

Semantic Segmentation

Competitive results without any fine-tuning on clustering an images into object classes.

Try the demo

Instance Retrieval

Directly use frozen features to find art pieces similar to a given image from a large art collection.

Try the demo

Dense Matching

DINOv2 patch features can be used to consistently map all parts of an image without supervision.

Try the demo

Sparse Matching

Compare DINOv2 patch features across two images to match their most similar parts.

Try the demo

Data, approach and other results

A large pretraining dataset of 142 million images was assembled and curated from webcrawled data to cover well a number of key visual domains.

The approach builds upon DINO and iBOT with several adjustments to improve both the quality of the features as well as the efficiency of pretraining.

Frozen features produced by the models are also evaluated on other visual tasks like coarse and fine-grained visual classification as well as video understanding. The results are thoroughly compared against other self-supervised and weakly supervised alternatives.

For more details, see the blog post and paper.