Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations

1Tel Aviv University, 2The Open University

CVPR 2026

CFT Motivation: Original vs CFT Relevance Maps

Relevance maps in standard ViTs often concentrate on spurious background cues (middle). CFT (right) steers the model toward class-relevant, discriminative concepts like the beak and wings of a bird or the fins of a fish, significantly improving semantic alignment and robustness.

Abstract

Vision Transformers (ViTs) often fail under distribution shifts because they learn spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods often overlook the fine-grained semantic concepts that truly define an object.

We introduce Concept-Guided Fine-Tuning (CFT), a framework that optimizes a model's internal relevance maps to align with spatially-grounded concept masks. These masks are generated automatically using an LLM to propose concepts and a Vision-Language Model (GroundingSAM) to segment them.

CFT requires only a minimal set of images (3 per class) and significantly enhances robustness across five out-of-distribution benchmarks (ImageNet-A, ObjectNet, etc.) while maintaining in-distribution accuracy.

How it Works

The CFT pipeline consists of three automated steps:

  • Concept Discovery: An LLM (GPT-4o-mini) proposes class-discriminative attributes (e.g., "long beak" for a bird).
  • Spatial Grounding: GroundingSAM localizes these concepts in images to create dynamic guidance masks.
  • Alignment: The model's AttnLRP relevance maps are optimized to match these masks, suppressing focus on the background.

Qualitative Results

Qualitative Corrections

CFT corrects misclassifications by shifting the model's focus. For example, a "common newt" misclassified as a "scorpion" due to background texture is corrected when the model focuses on the animal's body.

BibTeX

@inproceedings{elisha2026concept,
  author    = {Elisha, Yehonatan and Barkan, Oren and Koenigstein, Noam},
  title     = {Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}