Vision Transformers (ViTs) often fail under distribution shifts because they learn spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods often overlook the fine-grained semantic concepts that truly define an object.
We introduce Concept-Guided Fine-Tuning (CFT), a framework that improves vision models robustness on OOD samples by optimizing the model's internal relevance maps to align with spatially-grounded concept masks. These concept masks are generated automatically using recent VLM advancements for discovery and grounding.
CFT requires only a minimal set of images (3 per class) and significantly enhances robustness across five out-of-distribution benchmarks while maintaining in-distribution accuracy.