Vision Transformers (ViTs) often fail under distribution shifts because they learn spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods often overlook the fine-grained semantic concepts that truly define an object.
We introduce Concept-Guided Fine-Tuning (CFT), a framework that optimizes a model's internal relevance maps to align with spatially-grounded concept masks. These masks are generated automatically using an LLM to propose concepts and a Vision-Language Model (GroundingSAM) to segment them.
CFT requires only a minimal set of images (3 per class) and significantly enhances robustness across five out-of-distribution benchmarks (ImageNet-A, ObjectNet, etc.) while maintaining in-distribution accuracy.
The CFT pipeline consists of three automated steps:
CFT corrects misclassifications by shifting the model's focus. For example, a "common newt" misclassified as a "scorpion" due to background texture is corrected when the model focuses on the animal's body.
@inproceedings{elisha2026concept,
author = {Elisha, Yehonatan and Barkan, Oren and Koenigstein, Noam},
title = {Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026},
}