How to Train an Image Classification Model with DINOv3 on a Custom Dataset

Nitin Rai

18 Sep 2025 — 5 min read

Foundation models are revolutionizing computer vision, and Meta's DINOv3 stands out as a particularly powerful vision backbone. If you're looking to achieve state-of-the-art performance on your custom image classification tasks, DINOv3 is an excellent choice.

In this guide, we'll walk through a complete workflow for training an image classification model using DINOv3. We'll cover everything from data preparation to model training and evaluation, culminating in a model that achieves an impressive 96% accuracy on a challenging 200-class dataset a full 10% improvement over its predecessor, DINOv2.

Youtube Video

The Training Strategy: Efficient Fine-Tuning

The key to leveraging large, pre-trained models like DINOv3 without enormous computational cost is an efficient fine-tuning strategy. Instead of retraining the entire network, we will:

Freeze the Backbone: We'll keep the weights of the powerful, pre-trained DINOv3 model frozen. This backbone is already an expert at extracting meaningful features from images.
Add a Linear Head: We'll attach a simple, lightweight linear classification layer to the end of the backbone.
Train Only the Head: The training process will focus exclusively on this new linear layer, teaching it to map the features extracted by DINOv3 to our specific classes.

This approach is not only computationally efficient but also highly effective at preventing overfitting, especially with smaller datasets.

Step 1: Setting Up Your Project

A clean and reproducible setup is crucial for any machine learning project.

Dataset Preparation

For this tutorial, we will use the Birds 200 Species Classification dataset available on Kaggle. This is a great benchmark due to its high number of classes and fine-grained distinctions.

The dataset follows a standard and convenient structure:

images/
├── ABBOTTS BABBLER/
│   ├── 001.jpg
│   ├── 002.jpg
│   └── ...
├── AFRICAN CROWNED CRANE/
│   ├── 001.jpg
│   └── ...
└── ...

Each subdirectory within the main images folder is named after a bird species and contains all the images for that class. This format is easily readable by most data-loading utilities in deep learning frameworks. see notebook

Environment

To avoid dependency conflicts, it is highly recommended to use a virtual environment or a containerized solution like Docker. Ensure you have PyTorch, torchvision, and other common data science libraries installed.

Step 2: The Training Workflow

With the data in place, we can move on to the core of the training process.

Data Loading and Preprocessing

Load the Dataset: Using a utility like PyTorch's ImageFolder class, you can load the dataset with just a single line of code, pointing it to your images directory. It will automatically infer the class labels from the folder names.
Split the Data: Divide your dataset into training and validation sets. An 80/20 split is a standard starting point. The validation set is crucial for monitoring the model's performance on unseen data and preventing overfitting.
Create Label Mappings: Machine learning models require numerical inputs. Create two dictionaries (or mappings): one to convert the string-based class names (e.g., "ABBOTTS BABBLER") into unique integers, and another to map them back for interpreting the results.

data_dir = "./downloads/birds-200-species/CUB_200_2011/images"
full_dataset = datasets.ImageFolder(root=data_dir)

train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

num_classes = len(full_dataset.classes)
id2label = {i: c for i, c in enumerate(full_dataset.classes)}
label2id = {c: i for i, c in id2label.items()}

Building the Model

Load DINOv3 Weights: Download the pre-trained DINOv3 model weights. Note that DINOv3 weights are gated, so you'll need to agree to Meta's licensing terms to access them. For this task, the "base" model provides an excellent trade-off between performance and speed.
Construct the Custom Classifier: Create a custom model class that encapsulates the logic. This class should contain:
- The frozen DINOv3 backbone.
- The new, trainable linear classification head. The number of output neurons in this layer must match the number of classes in your dataset (200 in our case).

MODEL_NAME = "./downloads/dinov3-vitb16-pretrain-lvd1689m"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

image_processor = AutoImageProcessor.from_pretrained(MODEL_NAME)
backbone = AutoModel.from_pretrained(MODEL_NAME)
image_processor_config = json.loads(image_processor.to_json_string())
backbone_config = json.loads(AutoConfig.from_pretrained(MODEL_NAME).to_json_string())

class DinoV3Linear(nn.Module):
    def __init__(self, backbone: AutoModel, num_classes: int, freeze_backbone: bool = True):
        super().__init__()
        self.backbone = backbone
        if freeze_backbone:
            for p in self.backbone.parameters():
                p.requires_grad = False
            self.backbone.eval()
        
        hidden_size = getattr(backbone.config, "hidden_size", None)
        self.head = nn.Linear(hidden_size, num_classes)

    def forward(self, pixel_values):
        outputs = self.backbone(pixel_values=pixel_values)
        last_hidden = outputs.last_hidden_state
        cls = last_hidden[:, 0]
        logits = self.head(cls)
        return logits

freeze_backbone = True
model = DinoV3Linear(backbone, num_classes, freeze_backbone=freeze_backbone).to(device)

The Training Loop

The training loop is where the model learns. Here are the key components:

Configuration: Define your optimizer (e.g., Adam), loss function (Cross-Entropy Loss is standard for multi-class classification), and a learning rate scheduler.
Batch Iteration: Process the data in batches to manage memory usage and improve training stability.
Forward and Backward Pass: For each batch, pass the images through the model, calculate the loss, and perform backpropagation to update the weights of the linear head.
Model Checkpointing: After each epoch, run the model on the validation set to check its accuracy. Save only the model weights that achieve the best validation accuracy so far. This ensures that you end up with the most performant version of your model, and it also saves significant disk space.

optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=LR, weight_decay=WEIGHT_DECAY)
total_steps = EPOCHS * math.ceil(len(train_loader))
warmup_steps = int(WARMUP_RATIO * total_steps)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
criterion = nn.CrossEntropyLoss()
scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())

best_acc = 0.0
global_step = 0

trackio.init(project="dinov3", config={
            "epochs": EPOCHS,
            "learning_rate": LR,
            "batch_size": BATCH_SIZE
        })

for epoch in range(1, EPOCHS + 1):
    model.train()
    model.backbone.eval()  

    running_loss = 0.0
    for i, batch in enumerate(train_loader, start=1):
        pixel_values = batch["pixel_values"].to(device, non_blocking=True)
        labels = batch["labels"].to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)
        logits = model(pixel_values)
        loss = criterion(logits, labels)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

        running_loss += loss.item()
        global_step += 1

        torch.save(
            {
                "model_state_dict": model.state_dict(),
                "optimizer_state_dict": optimizer.state_dict(),
                "scheduler_state_dict": scheduler.state_dict(),
                "config": {
                    "model_name": MODEL_NAME,
                    "classes": full_dataset.classes,
                    "backbone": backbone_config,
                    "image_processor": image_processor_config,
                    "freeze_backbone": freeze_backbone,
                },
                "step": global_step,
                "epoch": epoch,
            },
            ckpt_path,
        )

Step 3: Inference and Performance Evaluation

Once training is complete, it's time to see how well your model performs.

Load the Best Model: Load the saved checkpoint that corresponds to the highest validation accuracy.
Make Predictions: Write a function to process a single image and feed it through the trained model. The output will be a set of scores for each class; the class with the highest score is the model's prediction.
Evaluate Performance: On the test set, the model achieved approximately 96% accuracy. This is a fantastic result that showcases the power of DINOv3's feature extraction capabilities. To gain further insight, it's helpful to build a visualization utility that displays images alongside their true labels and the model's predictions. This can help you understand the types of errors your model is making.

ckpt_path = "./weights/model_best.pt"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckpt = torch.load(ckpt_path, map_location=device)

ProcessorClass = getattr(transformers, ckpt["config"]["image_processor"]["image_processor_type"])
image_processor = ProcessorClass(**ckpt["config"]["image_processor"])
backbone = transformers.AutoModel.from_config(transformers.AutoConfig.for_model(**ckpt["config"]["backbone"]))
model = DinoV3Linear(
    backbone=backbone,
    num_classes=len(ckpt["config"]["classes"]),
    freeze_backbone=ckpt["config"].get("freeze_backbone", True),
).to(device)
classes = ckpt["config"]["classes"]

model.load_state_dict(ckpt["model_state_dict"])
model = model.eval()

def infer(image, device):
    with torch.no_grad():
        inputs = image_processor(images=image, return_tensors="pt").to(device)
        logits = model(inputs["pixel_values"])
        probs = torch.softmax(logits, dim=-1)
        pred = probs.argmax(dim=-1).item()
        conf = probs[0, pred].item()
        pred_class = classes[pred]
        return pred_class, conf

image_path = np.random.choice(images)
image = Image.open(image_path)
pred, conf = infer(image, device)
print(f"Predicted: {pred}, Conf: {conf}")
display(image.resize((224, 224)))

Conclusion

By combining the powerful feature extraction of Meta's DINOv3 with an efficient fine-tuning strategy, you can build highly accurate image classifiers for your own custom datasets. This method of freezing the backbone and training a simple linear head is a robust, resource-friendly technique that delivers state-of-the-art results. The principles outlined here can be applied to a wide range of computer vision problems, opening the door to new and exciting applications. see repository