Concept-Guided Fine-Tuning (CFT)

The Problem: Standard ViTs often fixate on backgrounds (middle). The Solution: CFT steers the model to focus on fine-grained semantic concepts (e.g., beaks, wings, fins), leading to massive gains in out-of-distribution robustness.

Abstract

Vision Transformers (ViTs) often fail under distribution shifts because they learn spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods often overlook the fine-grained semantic concepts that truly define an object.

We introduce Concept-Guided Fine-Tuning (CFT), a framework that improves vision models robustness on OOD samples by optimizing the model's internal relevance maps to align with spatially-grounded concept masks. These concept masks are generated automatically using recent VLM advancements for discovery and grounding.

CFT requires only a minimal set of images (3 per class) and significantly enhances robustness across five out-of-distribution benchmarks while maintaining in-distribution accuracy.

Key Contributions

Beyond Foreground Object Masks

Moves past binary object masks to fine-grained concepts (e.g., "long beak", "fins"), providing specific supervision for robustness.

Fully Automated

Zero manual annotation. Concepts are proposed by an LLM and spatially grounded by a VLM.

Training Efficiency

Requires only a minimal set of images (3 per class from half of the classes) to achieve state-of-the-art robustness gains.

How it Works

Concept Discovery: An LLM proposes class-discriminative attributes, such as "long beak" for a bird, to identify key semantic features.
Spatial Grounding: GroundedSAM spatially localizes these concepts within images to generate dynamic, adaptive guidance masks.
Semantic Alignment: The model's LRP (fulfills conservation) relevance maps are optimized to align with these concept regions while simultaneously suppressing focus on spurious background cues.

Empirical Robustness

CFT consistently outperforms state-of-the-art baselines across real-world and synthetic shifts.

Benchmark	Standard ViT-B	CFT (Ours)	Improvement
ImageNet-A (Adversarial)	13.26%	27.76%	+109%
ObjectNet (Pose/Viewpoint)	33.26%	54.28%	+63%
ImageNet-R (Artistic)	30.26%	48.47%	+60%

Correcting Model Failures

Case Study: On ImageNet-A, the baseline model misclassifies a "common newt" as a "scorpion" because it attends to the textured background. CFT corrects this by forcing the model to focus on the animal's body and semantic features, leading to the correct prediction.

BibTeX

@article{elisha2026concept,
  title={Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness},
  author={Elisha, Yehonatan and Barkan, Oren and Koenigstein, Noam},
  journal={arXiv preprint arXiv:2603.08309},
  year={2026}
}