The research
A family of models to encode visual features, evaluated across 30 different benchmarks covering 8 types of visual tasks from image classification to monocular depth estimation.
The DINOv2 family of models drastically improves over the previous state of the art in self-supervised learning (SSL), and reaches performance comparable with weakly-supervised features (WSL).
DINOv2 models are pretrained without supervision on a large, curated and diverse dataset of 142 million images.
DINOv2 models demonstrate strong out-of-distribution performance and the produced features are usable without requiring any fine-tuning.
State-of-the-art results and strong generalization on estimating depth from a single image.
Competitive results without any fine-tuning on clustering an images into object classes.
Directly use frozen features to find art pieces similar to a given image from a large art collection.
DINOv2 patch features can be used to consistently map all parts of an image without supervision.
A large pretraining dataset of 142 million images was assembled and curated from webcrawled data to cover well a number of key visual domains.
The approach builds upon DINO and iBOT with several adjustments to improve both the quality of the features as well as the efficiency of pretraining.
Frozen features produced by the models are also evaluated on other visual tasks like coarse and fine-grained visual classification as well as video understanding. The results are thoroughly compared against other self-supervised and weakly supervised alternatives.